We are searching data for your request:
Upon completion, a link will appear to access the found materials.
The two major classes of transposable elements are defined by the intermediates in the transposition process. One class moves by DNA intermediates, using transposases and DNA polymerases to catalyze transposition. The other class moves by RNA intermediates, using RNA polymerase, endonucleases and reverse transcriptase to catalyze the process. Both classes are abundant in many species, but some groups of organisms have a preponderance of one or the other. For instance, bacteria have mainly the DNA intermediate class of transposable elements, whereas the predominant transposable elements in mammalian genomes move by RNA intermediates.
- Transposable elements that move via DNA intermediates
- Transposable elements that move via RNA intermediates
Transposable elements that move via DNA intermediates
Among the most thoroughly characterized transposable elements are those that move by DNA intermediates. In bacteria, these are either short insertion sequences or longer transposons.
An insertion sequences, or IS, is a short DNA sequence that moves from one location to another. They were first recognized by the mutations they cause by inserting into bacterial genes. Different insertion sequences range in size from about 800 bp to 2000 bp. The DNA sequence of an IS has inverted repeats (about 10 to 40 bp) at its termini (Figure 9.10A.). Note that this is different from the FDRs, which are duplications of the target site. The inverted repeats are part of the IS element itself. The sequences of the inverted repeats at each end of the IS are very similar but not necessarily identical. Each family of insertion sequence in a species is named IS followed by a number, e.g. IS1, IS10, etc.
An insertion sequence encodes a transposase enzyme that catalyzes the transposition. The amount of transposase is well regulated and is the primary determinant of the rate of transposition. Transposons are larger transposable elements, ranging in size from 2500 to 21,000 bp. They usually encode a drug resistance gene or other marker besides the functions required for transposition (Figure 9.10.B.). One type of transposon, called a composite transposon, has an IS element at each end (Figure 9.10.C.). One or both IS elements may be functional; these encode the transposition function for this class of transposons. The IS elements flank the drug resistance gene (or other selectable marker). It is likely that the composite transposon evolved when two IS elements inserted on both sides of a gene. The IS elements at the end could either move by themselves or they can recognize the ends of the closely spaced IS elements and move them together with the DNA between them. If the DNA between the IS elements confers a selective advantage when transposed, then it will become fixed in a population.
What are the predictions of this model for formation of a composite transposon for the situation in which a transposon in a small circular replicon, such as a plasmid?
Figure 9.10. General structure of insertion sequences and transposons. Flanking direct repeats (FDRs) are shown as green triangles, inverted repeats (IRs) are red or purple triangles, insertion sequences (ISs) are yellow boxes with red triangles at the end, and other genes are boxes of different colors. The boxes and triangles include both strands of duplex DNA. DNA outside the FDRs is shown as one thick blue line for each strand. Tn5 has an IS50 element on each side, in an inverted orientation. Transcripts are shown as curly lines with an arrowhead pointing in the direction of transcription. The neoR gene for Tn5 is composed partly of the leftward IS (ISL) and partly of other sequences (included in the blue box). The transposase for Tn5 is encoded in the rightward IS (ISR).
The TnA family of transposons has been intensively studied for the mechanism of transposition. Members of the TnA family have terminal inverted repeats, but lack terminal IS elements (Figure 9.10). The tnpAgene of the TnA transposon encodes a transposase, and the tnpRgene encodes a resolvase. TnA also has a selectable marker, ApR, which encodes a beta-lactamase and makes the bacteria resistance to ampicillin.
Transposable elements that move via DNA intermediates are not limited to bacteria, but rather they are found in many species. The P elements and copiafamily of repeats are examples of such transposable elements in Drosophila, as are marinerelements in mammals and the controlling elements in plants. Indeed, the general structure of controlling elements in maize is similar to that of bacterial transposons. In particular, they end in inverted repeats and encode a transposase. As illustrated in Figure 9.11, the DNA sequences at the ends of an Ac element are very similar to those of a Dselement. However, internal regions, which normally encode the transposase, have been deleted. This is why Dselements cannot transpose by themselves, but rather they require the presence of the intact transposon, Ac, in the cell to provide the transposase. Since transposase works in trans, the Acelement can be anywhere in the genome, but it can act on Dselements at a variety of sites. Note that Ac is an autonomous transposon because it provides its own transposase and it has the inverted repeats needed to act as the substrate for transposase.
Figure 9.11. Structure of Ac and Ds controlling elements in maize is similar to that of an intact (Ac) or defective (Ds) transposon.
Mechanism of DNA-mediated transposition
Some families of transposable elements that move via a DNA intermediate do so in a replicative manner. In this case, transposition generates a new copy of the transposable element at the target site, while leaving a copy behind at the original site. A cointegrate structure is formed by fusion of the donor and recipient replicons, which is then resolved (Figure 9.12). Other families use a nonreplicative mechanism. In this case, the original copy excises from the original site and move to a new target site, leaving the original site vacant.
Figure 9.12: Contrasts between replicative and nonreplicative transposition. The transposable element (TE) is shown as an open arrow. The thick line for each replicon represents double stranded DNA; the different shadings represent different sequences.
Studies of bacterial transposons have shown that replicative transposition and some types of nonreplicative transposition proceed through a strand-transfer intermediate (also known as a crossover structure), in which both the donor and recipient replicons are attached to the transposable element (Figure 9.13). For replicative transposition, DNA synthesis through the strand-transfer intermediate produces a transposable element at both the donor and target sites, forming the cointegrate intermediate. This is subsequently resolved to separate the replicons. DNA synthesis does not occur at the crossover structure in nonreplicative transposition, thus leaving a copy only at the new target site. In an alternative pathway for nonreplicative transposition, the transposon is excised by two double strand breaks, and is joined to the recipient at a staggered break (illustrated at the bottom of Figure 9.12).
In more detail, there are two steps in common for replicative and nonreplicative transposition, generating the strand-transfer intermediate (Figure 9.13).
- The transposase encoded by a transposable element makes four nicks initially. Two nicks are made at the target site, one in each strand, to generate a staggered break with 5' extensions (3' recessed). The other two nicks flank the transposon; one nick is made in one DNA strand at one end of the transposon, and the other nick is made in the other DNA strand at the other end. Since the transposon has inverted repeats at each end, these two nicks that flank the transposon are cleavages in the same sequence. Thus the transposase has a sequence-specific nicking activity. For instance, the transposase from TnA binds to a sequence of about 25 bp located within the 38 bp of inverted terminal repeat (Figure 9.10). It nicks a single strand at each end of the transposon, as well as the target site (Figure 9.13). Note that although the target and transposon are shown apart in the two-dimensional drawing in Figure 9.13, they are juxtaposed during transposition.
- At each end of the transposon, the 3' end of one strand of the transposon is joined to the 5' extension of one strand at the target site. This ligation is also catalyzed by transposase. ATP stimulates the reaction but it can occur in the absence of ATP if the substrate is supercoiled. Ligation of the ends of the transposon to the target site generates a strand-transfer intermediate, in which the donor and recipient replicons are now joined by the transposon.
After formation of the strand-transfer intermediate, two different pathways can be followed. For replicative transposition, the 3' ends of each strand of the staggered break (originally at the target site) serve as primers for repair synthesis (Figure 9.13). Replication followed by ligation leads to the formation of the cointegrate structure, which can then be resolved into the separate replicons, each with a copy of the transposon. The resolvase encoded by transposon TnA catalyzes the resolution of the cointegrate structure. The site for resolution (res) is located between the divergently transcribed genes for tnpA and tnpR (Figure 9.10). TnA resolvase also negatively regulates expression of both tnpA and tnpR (itself).
For nonreplicative transposition, the strand-transfer intermediate is released by nicking at the ends of the transposon not initially nicked. Repair synthesis is limited to the gap at the flanking direct repeats, and hence only one copy of the transposon is left. This copy is ligated to the new target site, leaving a vacant site in the donor molecule.
Figure 9.13. Mechanism of transposition via a strand-transfer intermediate.
The enzyme transposase can recognize specific DNA sequences, cleave two duplex DNA molecules in four places, and ligate strands from the donor to the recipient. This enzyme has a remarkable ability to generate and manipulate the ends of DNA. A three-dimensional structure for the Tn5 transposase in complex with the ends of the Tn5 DNA has been solved by Rayment and colleagues. One static view of this protein DNA complex is in Figure 9.14.A. The transposase is a dimer, and each double-stranded DNA molecule (donor and target) is bound by both protein subunits. This orients the transposon ends into the active sites, as shown in the figure. Also, an image with just the DNA (Figure 9.14.B.) shows considerable distortion of the DNA helix at the ends. This recently determined structure is a good starting point to better understand the mechanism for strand cleavage and transfer.
Figure 9.14. Three-dimensional structure of the Tn5 transposase in complex with Tn5 transposon DNA. A. The dimer of the Tn5 transposase is shown bound to a fragment of duplex DNA from the end of the transposon. Alpha helices are green cylinders, beta sheets are yellow-brown, flat arrows and protein loops are blue wires. The DNA is a duplex of two red wires, one for each strand. B. The DNA is shown without the protein and with the nucleotides labeled. The end of the DNA at the top of this panel is oriented into the active site in the middle of the protein in panel A. The structure was determined by Davies DR, Goryshin IY, Reznikoff WS, Rayment I. (2000) “Three-dimensional structure of the Tn5 synaptic complex transposition intermediate.” Science 289:77-85. These images was obtained by downloading the atomic coordinates from the Molecular Modeling Database at NCBI, viewing them with CN3D 3.0 and saving static views as screen shots. The file for observing a virtual three-dimensional image is available at the course website.
Transposable elements that move via RNA intermediates
Transposable DNA sequences that move by an RNA intermediate are called retrotransposons. They are very common in eukaryotic organisms, but some examples have also been found in bacteria. Some retrotransposons have long terminal repeats (LTRs) that regulate expression (Figure 9.15). The LTRs were initially discovered in retroviruses. They have now been seen in some but not all retrotransposons. They have a strong promoter and enhancer, as well as signals for forming the 3’ end of mRNAs after transcription. The presence of the LTR is distinctive for this family, and members are referred to as LTR-containing retrotransposons. Examples include the yeast Ty-1family and retroviral proviruses in vertebrates. Retroviral proviruses encode a reverse transcriptase and an endonuclease, as well as other proteins, some of which are needed for viral assembly and structure.
Others retrotransposons are in the large and diverse class of non-LTR retrotransposons (Figure 9.15). One of the most prevalent examples is the family of long interspersed repetitive elements, or LINEs. It was initially found in mammals but has now been found in a broad range of phyla, including fungi. The first and most common LINE family in mammals is the LINE1 family, also called L1. An older family, but discovered later, is called LINE2. Full-length LINEs are about 7000 bp long, and there are about 10,000 copies in humans. Many other copies are truncated from the 5’ ends. Like retroviral proviruses, the full-length L1 encodes a reverse transcriptase and an endonuclease, as well as other proteins. However, the promoter is not an LTR. Other abundant non-LTR retrotransposons, initially discovered in mammals, are short interspersed repetitive elements, or SINEs. These are about 300 bp long. Alurepeats, with over a million copies, comprise the predominant class of SINEs in humans. Non-LTR retrotransposons besides LINEs are found in many other species, such as jockey repeats in Drosophila.
Figure 9.15. Four classes of transposable elements make up the vast majority of human repetitive DNA. From the Nature paper “Initial sequencing and analysis of the human genome,” by the International Human Genome Consortium.
Extensive studies in of genomic DNA sequences have allowed the reconstruction of the history of transposable elements in humans and other mammals. The major approach has been to classify the various types of repeats (themselves transposable elements), align the sequences and determine how different the members of a family are from each other. Since the vast majority of the repeats are no longer active in transposition, and have no other obvious function, they will accumulate mutations rapidly, at the neutral rate. Thus the sequence of more recently transposing members are more similar to the source sequence than are the members that transposed earlier. The results of this analysis show that the different families of repeats have propagated in distinct waves through evolution (Figure 9.16). The LINE2 elements were abundant prior to the mammalian divergence, roughly 100 million years ago. Both LINE1 and Alu repeats have propagated more recently in humans. It is likely that the LINE1 elements, which encode a nuclease and a reverse transcriptase, provide functions needed for the transposition and expansion of Alu repeats. LINE1 elements have expanded in all orders of mammals, but each order has a distinctive SINE, all of which are derived from a gene transcribed by RNA polymerase III. This has led to the idea that LINE1 elements provide functions that other different transcription units use for transposition.
Figure 9.16. Age distribution of repeats in human and mouse. The LINE2 and MIR repeats propagated before the mammalian radiation, about 100 million years ago, but Alu repeats are formed by recent transpositions in primates (light blue portion of the bar graphs in aand b). The LINE1 and LTR repeats are transposing with about the same frequency as they have historically in the mouse lineage (panels c and d), but few repeats are still transposing in human (panels a and b). From the Nature paper “Initial sequencing and analysis of the human genome,” by the International Human Genome Consortium.
Mechanism of retrotransposition
Although the mechanism of retrotransposition is not completely understood, it is clear that at least two enzymatic activities are utilized. One is an integrase, which is an endonuclease that cleaves at the site of integration to generate a staggered break (Figure 9.17). The other is RNA-dependent DNA polymerase, also called reverse transcriptase. These activities are encoded in some autonomous retrotransposons, including both LTR-retrotransposons such as retroviral proviruses and non-LTR-retrotransposons such as LINE1 elements.
The RNA transcript of the transposable element interacts with the site of cleavage at the DNA target site. One strand of DNA at the cleaved integration site serves as the primer for reverse transcriptase. This DNA polymerase then copies the RNA into DNA. That cDNA copy of the retrotransposon must be converted to a double stranded product and inserted at a staggered break at the target site. The enzymes required for joining the reverse transcript (first strand of the new copy) to the other end of the staggered break and for second strand synthesis have not yet been established. Perhaps some cellular DNA repair functions are used.
Figure 9.17. Transposition via an RNA-intermediate in retrotransposons. LINE1, or L1 repeats are shown as an example.
The model shown in Figure 9.17 is consistent with any RNA serving as the template for synthesis of the cDNA from the staggered break. However, LINE1 mRNA is clearly used much more often than other RNAs. The basis for the preference of the retrotransposition machinery for LINE1 mRNA is still being studied. Perhaps the endonuclease and reverse transcriptase stay associated with the mRNA that encodes them after translation has been completed, so that they act in cis with respect to the LINE1 mRNA. Other repeats that have expanded recently, such as Alu repeats in humans, may share sequence determinants with LINE1 mRNA for this cis preference.
Clear evidence that retrotransposons can move via an RNA intermediate came from studies of the yeast Ty-1 elements by Gerald Fink and his colleagues. They placed a particular Ty-1 element, called TyH3 under control of a GAL promoter, so that its transcription (and transposition) could be induced by adding galactose to the media. They also marked TyH3 with an intron. After inducing transcription of TyH3, additional copies were found at new locations in the yeast strain. When these were examined structurally, it was discovered that the intron had been removed. If the RNA transcript is the intermediate in moving the Ty-1 element, it is subject to splicing and the intron can be removed. Hence, these results fit the prediction of an RNA-mediated transposition. They demonstrate that during transposition, the flow of Ty-1 sequence information is from DNA to RNA to DNA.
If yeast Ty-1 moved by the mechanism illustrated for DNA-mediated replicative transposition in Figure 9.13, what would be predicted in the experiment just outlined? Also, would you expect an increase in transposition when transcription is induced?
Additional Consequences of Transposition
Not only can transposable elements interrupt genes or disrupt their regulation, but they can cause additional rearrangements in the genome. Homologous recombination can occur between any two nearly identical sequences. Thus when transposition makes a new copy of a transposable element, the two copies are now potential substrates for recombination. The outcome of recombination depends on the orientation of the two transposable elements relative to each other. Recombination between two transposable elements in the same orientation on the same chromosome leads to a deletion, whereas it results in an inversion if they are in opposite orientations (Figure 9.18).
Figure 9.18. Possible outcomes of recombination between two transposable elements.
The preference of the retrotransposition machinery for LINE1 mRNA does not appear to be absolute. Many processed genes have been found in eukaryotic genomes; these are genes that have no introns. In many cases, a homologous gene with introns is seen in the genome, so it appears that these processed genes have lost their introns. It is likely that these were formed when processed mRNA derived from the homologous gene with introns was copied into cDNA and reinserted into the genome. Many, but not all, of these processed genes are pseudogenes, i.e. they have been mutated such that they no longer encode proteins. Other examples of active processed genes have inserted next to promoters and encode functional proteins.
Widespread roles of enhancer-like transposable elements in cell identity and long-range genomic interactions
A few families of transposable elements (TEs) have been shown to evolve into cis-regulatory elements (CREs). Here, to extend these studies to all classes of TEs in the human genome, we identified widespread enhancer-like repeats (ELRs) and find that ELRs reliably mark cell identities, are enriched for lineage-specific master transcription factor binding sites, and are mostly primate-specific. In particular, elements of MIR and L2 TE families whose abundance co-evolved across chordate genomes, are found as ELRs in most human cell types examined. MIR and L2 elements frequently share long-range intra-chromosomal interactions and binding of physically interacting transcription factors. We validated that eight L2 and nine MIR elements function as enhancers in reporter assays, and among 20 MIR-L2 pairings, one MIR repressed and one boosted the enhancer activity of L2 elements. Our results reveal a previously unappreciated co-evolution and interaction between two TE families in shaping regulatory networks.
© 2019 Cao et al. Published by Cold Spring Harbor Laboratory Press.
Enhancer- and promoter-like repeat elements…
Enhancer- and promoter-like repeat elements (ELRs and PLRs) in human tissues and cell…
ELRs mark cell identities. (…
ELRs mark cell identities. ( A ) Correlations of each pair of tissues…
hESC- and iPSC-specific ELRs mark…
hESC- and iPSC-specific ELRs mark the master TF binding sites. ( A )…
Association between MIR and L2.…
Association between MIR and L2. ( A ) Heat map of Jaccard index…
Experimental validation of the enhancer…
Experimental validation of the enhancer activity and interaction between MIR and L2. (…
9.6: Classes of Transposable Elements - Biology
All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited.
Feature Papers represent the most advanced research with significant potential for high impact in the field. Feature Papers are submitted upon individual invitation or recommendation by the scientific editors and undergo peer review prior to publication.
The Feature Paper can be either an original research article, a substantial novel research study that often involves several techniques or approaches, or a comprehensive review paper with concise and precise updates on the latest progress in the field that systematically reviews the most exciting advances in scientific literature. This type of paper provides an outlook on future directions of research or possible applications.
Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to authors, or important in this field. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.
Results and discussion
TE content and distribution along the 21 bread wheat chromosomes
Building from a decade-long effort from the wheat genomics community, we used the accumulated knowledge about TEs to precisely delineate the TE repertoire of the 21 chromosomes based on a similarity search with a high-quality TE databank: ClariTeRep  which includes TREP . This represents 3050 manually annotated and curated TEs carried by the three subgenomes and mainly identified on bacterial artificial chromosome (BAC) sequences obtained during map-based cloning or survey sequencing projects, especially on chromosome 3B . CLARITE was used to model TEs in the sequence and their nested insertions when possible . This led to the identification of 3,968,974 TE copies, belonging to 505 families, and representing 85% of RefSeq_v1.0. Overall, the TE proportion is very similar in the A, B, and D subgenomes, as they represented 86%, 85%, and 83% of the sequence, respectively. However, the sizes of the subgenomes differ: with 5.18 Gb, the B subgenome has the largest assembly size, followed by the A subgenome (4.93 Gb) and the smaller D subgenome (3.95 Gb). The repetitive fraction is mostly dominated by TEs of the class I Gypsy and Copia and class II CACTA superfamilies other superfamilies contribute very little to overall genome size (Table 1, Fig. 1a).
TE composition of the three wheat subgenomes and examples of chromosomal distributions. a Stacked histograms representing the contribution of each TE superfamily to the three subgenomes. Un-annotated sequences are depicted in white and coding exons (accounting only the representative transcript per gene) in orange. b Distribution of TE subfamilies along wheat chromosome 1A (as a representative of all chromosomes). The full datasets are shown in Additional file 1: Figures S1–S11. The TE distribution is shown in 30-Mb windows along chromosomes. TE abundance per 30-Mb window is shown as a heat-map and as a bar plot. The x-axis indicates the physical position in Mb, while the y-axis indicates the number of kb the TE family contributes to each 30 Mb. The total contribution in Mb of the respective TE family to the chromosome is depicted at the left
At the superfamily level, the A, B, and D subgenomes have similar TE compositions (Fig. 1a). The smaller size of the D subgenome (
1 Gb smaller than A and B) is mainly due to a smaller amount of Gypsy (
800 Mb less Fig. 1a). The A and B subgenomes differ in size by only 245 Mb (
5%), and nearly half of this (106 Mb) is not due to known TEs but rather to low copy sequences. Since the amount of coding DNA is very conserved (43, 46, and 44 Mb, respectively), this difference is mainly due to parts of the genome that remained un-annotated so far. This un-annotated portion of the genome may contain degenerated and unknown weakly repeated elements.
Similar to other complex genomes, only six highly abundant TE families represent more than half of the TE content: RLC_famc1 (Angela), DTC_famc2 (Jorge), RLG_famc2 (Sabrina), RLG_famc1 (Fatima), RLG_famc7 (Sumana/Sumaya), and RLG_famc5 (WHAM), while 486 families out of 505 (96%) each account for less than 1% of the TE fraction. In terms of copy number, 50% (253) of the families are repeated in fewer than 1000 copies at the whole genome level, while more than 100,000 copies were detected for each of the seven most repeated families (up to 420,639 Jorge copies).
Local variations of the TE density were observed following a pattern common to all chromosomes: the TE proportion is lower (on average 73%) in the distal regions than in the proximal and interstitial regions (on average 89%). However, much stronger local variations were observed when distributions of individual TE families were studied. Figure 1b shows TE distributions using chromosome 1A as a representative example. Distributions for selected TE families on all chromosomes are shown in Additional file 1: Figures S1–S11. The most abundant TE family, RLC_famc1 (Angela) was enriched towards telomeres and depleted in proximal regions. In contrast, highly abundant Gypsy retrotransposons RLG_famc2 (Sabrina, Fig. 1b) and RLG_famc5 (WHAM, not shown) were enriched in central parts of chromosome arms and less abundant in distal regions. CACTA TEs also showed a variety of distribution patterns. They can be grouped into distinct clades depending on their distribution pattern, as suggested earlier based on chromosome 3B TE analyses . Families of the Caspar clade  are highly enriched in telomeric regions, as is shown for the example of the DTC_famc1 (Caspar) whereas DTC_famc2 (Jorge) showed the opposite pattern (Fig. 1b).
Centromeres have a specific TE content. Previous studies on barley and wheat reported that the Gypsy family RLG_famc8.3 (Cereba) is enriched in centromeres [22, 23]. It was speculated that Cereba integrase can target centromere-specific heterochromatin due to the presence of a chromodomain that binds specifically to centromeric histones . We found that wheat Cereba elements are concentrated in centromeric regions but absent from the rest of the genome (Fig. 1b, Additional file 1: Figure S8), as are their closely related subfamilies RLG_famc8.1 and RLG_famc8.2 (Quinta). We identified new TE families that are also highly enriched in centromeres. The family RLG_famc39 (Abia) is a relative of Cereba, although there is very little sequence DNA conservation between the two. However, at the protein level, Cereba is its closest homolog. Abia and Cereba have an extremely similar distribution (Fig. 1b, Additional file 1: Figures S8 and S9). Interestingly, on chromosome 6A Cereba is more abundant, while on 3B, Abia is more abundant, suggesting that the two TE families are competing for the centromeric niche. Abia seems to be a wheat-specific TE family, as it was not present in the recently published barley genome . A recent study on the barley genome reported on a novel centromeric Gypsy family called Abiba . We identified a homolog in wheat: RLG_famc40 (Abiba), with two distinct subfamilies RLG_famc40.1 and RLG_famc40.2, corresponding to the putatively autonomous and non-autonomous variants. Abiba is enriched in central parts of chromosomes but with a broader spreading compared to Abia and Cereba (Additional file 1: Figures S10 and S11). At a higher resolution, we identified large tandem arrays of Cereba and Abia elements that correspond to the high k-mer frequencies observed at the centromeres (Fig. 2d), which might be the signature of functional centromeres (Additional file 1: Figure S12).
Variability and similarity of the repeat composition of the three wheat subgenomes. a Example of sequence alignment of three homeologous regions of ca. 300 kb on chromosomes 3A (from 683.185 to 683.435 Mb), 3B (from 723.440 to 723.790 Mb), and 3D (from 546.330 to 546.700 Mb). Genes red boxes, TEs blue boxes. Sequences sharing > 90% identity over more than 400 bp are represented by red (+/+ strand matches) and blue (+/− strand matches) areas. It shows the high conservation between homeologous genes and collinearity between A-B-D, and it shows the absence of TEs in syntenic positions while intergenic distances tend to be similar between homeologs. Similarities observed between TEs are not collinear and thus strongly suggest independent insertions, in the three subgenomes, of TEs from the same family instead of homeologous relationships. b Proportions of the 20 most abundant TE families comprising the hexaploid wheat genome depicted as fractions of A, B, and D subgenomes. For each family, the A-B-D fractions are represented in green, violet, and orange, respectively. 1 RLC_famc1 (Angela WIS) 2 DTC_famc2 (Jorge) 3 RLG_famc2 (Sabrina Derami Egug) 4 RLG_famc1 (Fatima) 5 RLG_famc7 (Erika Sumana Sumaya) 6 RLG_famc5 (WHAM Wilma Sakura) 7 RLG_famc3 (Laura) 8 RLG_famc4 (Nusif) 9 RLG_famc11 (Romana Romani) 10 RLG_famc10 (Carmilla Ifis) 11 RLC_famc3 (Claudia Maximus) 12 RLG_famc13 (Latidu) 13 RLG_famc6 (Wilma) 14 RLG_famc9 (Daniela Danae Olivia) 15 RLC_famc2 (Barbara) 16 DTC_famc1 (Caspar Clifford Donald Heyjude) 17 RLG_famc14 (Lila) 18 RLG_famc15 (Jeli) 19 RLG_famc8 (Cereba Quinta) 20 DTC_famc6 (TAT1). c k-mer-defined proportion of repeats of the subgenomes. Cumulative genome coverage of 20- and 60-mers at increasing frequencies. Around 40% of each subgenome assembly consists of 20-mers occurring > = 100 times. At the 60-mer level the D subgenome has the highest and B the lowest proportion of repeats. d Distribution of 20-mer frequencies across physical chromosomes. The B subgenome has the lowest overall proportion of repeats
Similarity and variability of the TE content between the A, B, and D subgenomes
A genome-wide comparative analysis of the 107,891 high-confidence genes predicted along the A, B, and D subgenomes (35,345, 35,643, and 34,212, respectively) was described in detail in . It revealed that 74% of the genes are homeologs, with the vast majority being syntenic. Thus, gene-based comparisons of A-B-D highlighted a strong conservation and collinearity of the genes between the three genomes. However, outside the genes and their immediate surrounding regions, we found almost no sequence conservation in the TE portions of the intergenic regions (Fig. 2a). This is due to the “TE turnover” , which means that intergenic sequences (i.e., sequences that are not under selection pressure) evolve through rounds of TE insertions and deletions in a continuing process: DNA is produced by TE insertions into intergenic regions and removed by unequal crossovers or deletions that occur during double-strand repair . Previous studies showed that this process occurs at a pace implying that intergenic sequences are completely turned over within a few million years [27, 28]. Consequently, we found practically no conserved TEs (i.e., TEs that were inserted in the common ancestor of the A, B, and D genome donors). Thus, although the repetitive fraction in A, B, and D genomes is mostly composed of the same TE families (see below), their individual insertion sites and nesting patterns are completely different.
Analysis of the k-mer content of RefSeq_v1.0 showed that 20-mers occurring 100× or more cover around 40% of the wheat genome sequence (Fig. 2c). For 60-mers, this value decreases to only 10%. This pattern was strongly similar between subgenomes, although a slight difference was observed: repeated k-mers covered a larger proportion of the subgenome D > A > B. This lower proportion of repeats in the B subgenome is also obvious using a heat-map of 20-mer frequencies (Fig. 2d), showing that the B genome contains a smaller proportion of high copy number perfect repeats.
We then compared the A, B, and D subgenomes at the TE family level. We did not find any TE families (accounting > 10 kb) that are specific for a single subgenome or completely absent in one subgenome (only two cases of subgenome-specific tandem repeats were found: XXX_famc46/c47). More surprisingly, the abundance of most TE families is similar in the A, B, and D subgenomes. Indeed, among the 165 families which represent at least 1 Mb of DNA each, 125 (76%) are present in similar proportions in the three subgenomes i.e., we found less than a twofold change of the proportion between subgenomes. Figure 2b represents the proportions of the 20 most abundant families in the three subgenomes which account for 84% of the whole TE fraction. Their proportion is close to the relative sizes of the three subgenomes: 35%, 37%, 28% for A, B, D, respectively. This highlighted the fact that not only are the three subgenomes shaped by the same TE families, but also that these families are present in proportions that are conserved. Consistent with this, we identified only 11 TE families (7%) that show a strong difference (i.e., more than a threefold change in abundance) between two subgenomes, representing only 2% of the overall TE fraction.
Thus, despite the near-complete TE turnover that has occurred independently in the A-B-D diploid lineages (Fig. 2a), and although TEs have transposed and proliferated very little since polyploidization (0.5 Mya, see below), the TE families that currently shape the three subgenomes are the same, and more strikingly, their abundance remained very similar. We conclude that almost all families ancestrally present in the A-B-D common ancestor have been active at some point and their amplification has compensated their loss by deletion, thus suggesting a dynamic in which families are maintained at equilibrium in the genome for millions of years. This evolutionary scenario differs from the model where TEs evolve by massive bursts of a few families leading to rapid diversification . For example, Piegu et al. showed that an amplification burst of a single retrotransposon family led to a near doubling of the genome size in Oryza australiensis . In wheat, by contrast, many TE families contribute to the genome diversification, as suggested for plants with very large genomes (> 30 Gb) .
Strong differences in abundance between the A, B, and D genomes were observed at the subfamily level (Fig. 3). For example, the highly abundant RLC_famc1 (Fatima) family has diverged into at least five subfamilies (1.1 to 1.5). Only RLC_famc1.1 contains potentially functional reverse transcriptase (RT) and integrase (INT) genes, while RLC_famc1.4 and RLC_famc1.5 contain gag and protease open reading frames (ORFs). RLC_famc1.2 and RLC_famc1.3 appear to be non-autonomous, as they do not contain any intact ORFs. We suggest that RLC_famc1.1 provides functional RT and INT proteins, while protease and GAG are provided by other subfamilies. Their contrasted abundance revealed that RLC_famc1.4 and RLC_famc1.5 proliferated specifically in the B and A lineages, respectively (Fig. 3a).
Distribution of different subfamilies in the A, B, and D subgenomes. a Distribution of RLC_famc1 (Fatima) retrotransposons. Group 6 chromosomes were chosen as representative for the whole genome. A phylogenetic tree of the different subfamilies is shown at the left. For the construction of the phylogenetic tree, the LTR sequences were used (internal domains between RLC_famc1.1 and the other subfamilies are completely different, as only RLC_famc1.1 contains reverse transcriptase and integrase genes). Bootstrap values (100 repetitions) are indicated. Sequence organization and gene content of the individual subfamilies are shown to the right of the tree. Chromosomal distributions are shown at the right in bins of 50 Mb as heat-maps and bar plots to indicate absolute numbers. The y-axis indicates the total number of kb that is occupied by the respective subfamily in each bin. The most recently diverged subfamilies RLC_famc1.4 and RLC_famc1.5 show strong differences in abundance in different subgenomes. b Examples of TE subfamilies that have strongly differing copy numbers in the A, B, and D subgenomes. Again, only a single group of homeologous chromosomes is shown (see Additional file 1: Figures S1–S3 for the other chromosomes). Abundance is shown in 30-Mb windows
In total, we identified 18 different subfamilies (belonging to 11 different families) which show subgenome-specific over- or under-representation (Table 2). Here, we only considered TE families that contribute more than 0.1% to the total genome and are at least threefold over- or under-represented in one of the subgenomes. This illustrated that these 11 highly abundant families did not show a bias between A-B-D at the family level, but are composed of several subfamilies that were differentially amplified in the three diploid lineages. The CACTA family DTC_famc10.3 (Pavel) is much more abundant in the D subgenome than in the A and B subgenomes (Additional file 1: Figure S1). Interestingly, the Pavel subfamily also seems to have evolved a preference for inserting close to centromeres in the D subgenome, while this tendency is not obvious in the A and B subgenomes (Fig. 3b). Generally, subfamilies were enriched in a single genome (Table 2). In only four cases, a subfamily was depleted in one subgenome while abundant at similar levels in the other two. Three of these cases were found in the D subgenome. This is consistent with the smaller D subgenome size, and differences in highly abundant elements contribute to this difference.
Dynamics of LTR retrotransposons from the diploid ancestors to the hexaploid
The largest portion of plant genomes with size over 1 Gb consists of LTR-RTs. Intact full-length elements represent recently inserted copies, whereas old elements have experienced truncations, nested insertions, and mutations that finally lead to degenerated sequences until they become unrecognizable. Full-length LTR-RTs (flLTR-RTs) are bordered by two LTRs that are identical at the time of insertion and subsequently diverge by random mutations, a characteristic that is used to determine the age of transposition events . In previous genome assemblies, terminal repeats tended to collapse, which resulted in very low numbers of correctly reconstructed flLTR-RTs (triangles in Additional file 1: Figure S13). We found 112,744 flLTR-RTs in RefSeq_v1.0 (Additional file 1: Table S1, Figure S13), which was in line with the expectations and confirmed the linear relationship between flLTR-RTs and genome size within the Poaceae. This is two times higher than the number of flLTR-RTs assembled in TGAC_v1 , while almost no flLTR-RTs were assembled in the 2014 gene-centric draft assembly .
We exploited this unique dataset to gain insights into the evolutionary history of hexaploid wheat from a transposon perspective. flLTR-RTs are evenly distributed among the subgenomes, with on average 8 elements per Mb (Additional file 1: Table S1). Among them, there were two times more Copia (RLC) than Gypsy (RLG) elements, although Gypsy elements account for 2.8× more DNA. This means that the proportion of young intact elements is higher for the Copia superfamily than for the Gypsy superfamily. Indeed, the median insertion ages for Copia, Gypsy, and RLX (unclassified LTR-RTs) are 0.95, 1.30, and 1.66 million years (Myr). RLXs lack a protein domain, preventing a straightforward classification into Gypsy or Copia. The missing domains can most likely be accounted for by their older age and, thus, their higher degree of degeneration. RLX elements are probably unable to transpose on their own, but the occurrence of such very recently transposed elements suggests that they are non-autonomous, as described for the Fatima subfamilies (Fig. 3a). Between the A and B subgenomes, all flLTR-RT metrics are very similar, whereas the D subgenome stands out with younger insertions. In any case, age distributions of flLTR-RTs show that most of the identified full-length elements inserted after the divergence of the three subgenomes, thereby reflecting the genomic turnover that has removed practically all TEs that were present in the A-B-D ancestor (see above).
We analyzed the chromosomal distributions of the flLTR-RTs (Additional file 1: Figure S14). The whole set of elements is relatively evenly scattered along the chromosomes with high density spots in the distal gene-rich compartments. The most recent transpositions (i.e., copies with two identical LTRs) involved 457 elements: 257 Copia, 144 Gypsy, and 56 RLXs. They are homogeneously distributed along the chromosomes (Additional file 1: Figure S14B), confirming previous hypotheses stating that TEs insert at the same rate all along the chromosome but are deleted faster in the terminal regions, leading to gene-rich and TE-depleted chromosome extremities .
The current flLTR-RT content is the outcome of two opposing forces: insertion and removal. Therefore, we calculated a persistence rate, giving the number of elements per 10,000 years that have remained intact over time, for the 112,744 flLTR-RTs (Fig. 4a). It revealed broad peaks for each superfamily, with maxima ranging from 0.6 Mya (for Copia in the D subgenome) to 1.5 Mya (for RLX in the A and B subgenomes). The D subgenome contained on average younger flLTR-RTs compared to A and B, with a shift of activity by 0.5 Myr. Such peaks of age distributions are commonly interpreted in the literature as transposon amplification bursts. We find the “burst” analogy misleading, because the actual values are very low. For wheat, it represents a maximal rate of only 600 copies per 10,000 years. A more suiting analogy would be the formation of mountain ranges, where small net increases over very long time periods add up to very large systems. In the most recent time (< 10,000 years), after the hexaploidization event, we did not see any evidence in our data for the popular “genomic shock” hypothesis, postulating immediate drastic increases of transposon insertions [34,35,36]. For the A and B subgenomes, a shoulder in the persistence curves around 0.5 Mya (Fig. 4a), the time point of tetraploidization, was observed. We suggest that counter-selection of harmful TE insertions was relaxed in the tetraploid genome i.e., the polyploid could tolerate insertions which otherwise would have been removed by selection in a diploid.
Insertion time frames of wheat LTR retrotransposons. a Persistence rate in number of elements per 10,000 years that have remained intact until now (meaning they have not been removed or truncated over time). The D subgenome has younger flLTR-RTs, the curves for all superfamilies are shifted by
0.5 Myr. The shoulder at 0.5 Myr in the A and B subgenomes could reflect a decrease in removal rates after the tretraploidization. b Comparison of different cluster stringencies. y-axis: subgenome specificity of the clusters, e.g., “ABD” has members from all three subgenomes, “AB” only from A and B x-axis: log cluster size the color coding gives the number of clusters the circle area corresponds to the number of elements. The family clustering at 80% identity over 80% mutual coverage generates large clusters, but has a low proportion of subgenome-specific clusters. The 90/90 subfamily level cluster set with a high number of subgenome-specific clusters and three large ABD clusters was used for further analyses. c Lifespan of subfamilies containing only either A, B, or D members. The line thickness represents cluster size. Lineages unique to the A or B subgenome occur only down to
0.5 Myr, confirming the estimated time point for the tetraploidization. However, D subgenome-unique lineages kept on proliferating, a clear sign for a very recent hexaploidization
To elucidate the TE amplification patterns that have occurred before and after polyploidization, we clustered the 112,744 flLTR-RTs based on their sequence identity. The family level was previously defined at 80% identity over 80% sequence coverage (80/80 clusters) . We also clustered the flLTR-RTs using a more stringent cutoff of 90/90 and 95/95 to enable classification at the subfamily level (Fig. 4b). The 80/80 clusters were large and contained members of all three subgenomes. In contrast, the 90/90 and 95/95 clusters were smaller, and a higher proportion of them are specific to one subgenome. To trace the polyploidization events, we defined lifespans for each individual LTR-RT subfamily as the interval between the oldest and youngest insertion (Fig. 4c). Subfamilies specific to either the A or B subgenome amplified until about 0.4 Myr, which is consistent with the estimated time of the tetraploidization. Some of the D subgenome-specific subfamilies inserted more recently, again consistent with the very recent hexaploidization.
These results confirmed that the three subgenomes were shaped by common families present in the A-B-D common ancestor that have amplified independently in the diploid lineages. They evolved to give birth to different subfamilies that, generally, did not massively amplify after polyploidization and, thus, are specific to one subgenome. To confirm this hypothesis, we explored the phylogenetic trees of the three largest 90/90 clusters color-coded by subgenome (Fig. 5 and Additional file 1: Figures S15–S17 for more details). The trees show older subgenome-specific TE lineages which have proliferated in the diploid ancestors (2–0.5 Mya). However, the youngest elements (< 0.5 Mya) were found in clades interweaving elements of the A and B subgenomes, corresponding to amplifications in the tetraploid. Such cases involving the D subgenome were not observed, showing that flLTR-RTs from D have not yet transposed in large amounts across the subgenomes since the birth of hexaploid wheat 8000–10,000 years ago. We further noticed several incidences in the trees where D lineages were derived from older B or A lineages, but not the reverse. This may be explained by the origin of the D subgenome through homoploid hybridization between A and B .
LTR retrotransposon footprints in the evolution of hexaploid wheat. a Evolution of the wheat genome with alternative scenarios and timescales. The dotted rectangles and * time values represent the scenario of A and B giving rise to the D subgenome by homoploid hybridization . The left timescale is based on another estimate based on the chloroplast genome evolution . The dotted horizontal arrows represent the unidirectional horizontal transposon transfers observed in this study. b Phylogenetic tree of the largest 90/90 cluster (6639 copies). c Top2 cluster (5387 copies), d Top3 cluster (4564 copies). The leaves of the tree are colored by the subgenome localization of the respective elements. The majority of the amplifications took place in the diploid ancestors evidenced by the single colored propagation lineages. Each tree contains one or several younger regions with interweaving A and B insertions (marked by ABAB). These younger proliferations only started in the AABB tetraploid, where the new elements inserted likewise into both subgenomes. The joining of the D genome was too recent to have left similar traces yet. The gray asterisks mark D lineages that stem from a B or A lineage
There are two proposed models of propagation of TEs: the “master copy” model and the “transposon” model . The “master copy” model gives rise to highly unbalanced trees (i.e., with long successive row patterns) where one active copy is serially replaced by another, whereas the “transposon” model produces balanced trees where all branches duplicate with the same rate . To better discern the tree topologies, we plotted trees with equal branch length and revealed that the three largest trees (comprising 15% of flLTR-RTs) are highly unbalanced (Additional file 1: Figure S18), while the smaller trees are either balanced or unbalanced (Additional file 1: Figure S19). Taken together, both types of tree topologies exist in the proliferation of flLTR-RTs, but there is a bias towards unbalanced trees for younger elements, suggesting that TE proliferation followed the “master copy” model.
In summary, our findings give a timed TE atlas depicting detailed TE proliferation patterns of hexaploid wheat. They also show that polyploidization did not trigger bursts of TE activity. This dataset of well-defined transposon lineages now provides the basis to further explore the factors controlling transposon dynamics. Founder elements may help us obtain better insights into common patterns which could explain how and why amplification starts.
A stable genome structure despite the near-complete TE turnover in the intergenic sequences
As described above, intergenic sequences show almost no conservation between homeologous loci. That means they contain practically no TEs that have inserted already in the common ancestor of the subgenomes. Instead, ancestral sequences were removed over time and replaced by TEs that have inserted more recently. Despite this near-complete turnover of the TE space (Fig. 2a), the gene order along the homeologous chromosomes is well conserved between the subgenomes and is even conserved with the related grass genomes (sharing a common ancestor 60 Mya ). Most interestingly and strikingly, not only gene order but also distances between neighboring homeologs tend to be conserved between subgenomes (Fig. 6). Indeed, we found that the ratio of distances between neighboring homeologs has a strong peak at 1 (or 0 in log scale on Fig. 6), meaning that distances separating genes tend to be conserved between the three subgenomes despite the TE turnover. This effect is non-random, as ratio distribution curves are significantly flatter (p = 1.10 − 5 ) when gene positions along chromosomes are randomized. These findings suggest that distances between genes are likely under selection pressure.
Comparison of distances between neighboring homeologs in the subgenomes. a Distances between genes and their closest neighbors were compared to those of their homeologous partners from the other subgenomes. For each homeolog triplet, three ratios were calculated (i.e., pairwise comparisons between the three subgenome homeologs). If the distance is similar in two subgenomes, the ratio will be close to 1. b Comparison of 2275 gene pairs from the terminal 150 Mb of short chromosome arms from A and B genomes. The distribution is compared to one where gene positions were randomized (see Methods). The observed data has a sharper peak at 1 (logarithmic scale where log(1) = 0). This indicates that distances between homeologs are conserved, despite the near-complete absence of conservation of intergenic sequences between subgenomes. c Analogous comparison of homeolog pairs from the A and D subgenomes. d Analogous comparison of homeolog pairs from the B and D subgenomes
We found this constrained distribution irrespective of the chromosome compartments, i.e., distal, interstitial, and proximal, exhibiting contrasted features at the structural (gene density) and functional (recombination rate, gene expression breadth) levels [25, 26]. However, constraints applied on intergenic distances seem relaxed (broader peak in Fig. 6) in proximal regions where the meiotic recombination rate is extremely low. At this point, we can only speculate about the possible impact of meiotic recombination as a driving force towards maintaining a stable chromosome organization. Previous studies have shown that recombination in highly repetitive genomes occurs mainly in or near genes . We hypothesize that spacing of genes is preserved for proper expression regulation or proper pairing during meiosis. Previous studies on introgressions of divergent haplotypes in large-genome grasses support this hypothesis. For instance, highly divergent haplotypes which still preserve the spacing of genes have been maintained in wheats of different ploidy levels at the wheat Lr10 locus .
Enrichment of TE families in gene promoters is conserved between the A, B, and D subgenomes
The sequences flanking genes have a very distinct TE composition compared to the overall TE space. Indeed, while intergenic regions are dominated by large TEs such as LTR-RTs and CACTAs, sequences surrounding genes are enriched in small TEs that are usually just a few hundred base pairs in size (Fig. 7). Immediately upstream and downstream of genes (within 2 kb), we identified mostly small non-autonomous DNA transposons of the Harbinger and Mariner superfamilies, referred to as Tourist and Stowaway miniature inverted-repeat transposable elements (MITEs), respectively , SINEs, and Mutators (Fig. 7). At the superfamily level, the A, B, and D subgenomes exhibit the same biased composition in gene surrounding regions (Additional file 1: Figure S20). We then computed, independently for each subgenome, the enrichment ratio of each TE family that was present in the promoter of protein-coding genes (2 kb upstream of the transcription start site (TSS)) compared to their overall proportion (in copy number, considering the 315 TE families with at least 500 copies). The majority (242, 77%) showed a bias (i.e., at least a twofold difference in abundance) in gene promoters compared to their subgenome average, confirming that the direct physical environment of genes contrasts with the rest of the intergenic space. Considering a strong bias, i.e., at least a threefold over- or under-representation in promoters, we found 105 (33%) and 38 (12%) families, respectively, that met this threshold in at least one subgenome. While it was previously known that MITEs were enriched in promoters of genes, here we show that this bias is not restricted to MITEs but rather involves many other families. Again, although TEs that shaped the direct gene environment have inserted independently in the A, B, and D diploid lineages, their evolution converged to three subgenomes showing very similar TE composition. To go further, we showed that the tendency of TE families to be enriched in, or excluded from, promoters was extremely conserved between the A, B, and D subgenomes (Fig. 8), although TEs are not conserved between homeologous promoters (inserted after A-B-D divergence), except for a few cases of retained TEs (see below). In other words, when a family is over- or under-represented in the promoter regions of one subgenome, it is also true for the two other subgenomes. We did not find any family that was enriched in a gene promoter in one subgenome while under-represented in gene promoters of another subgenome.
TE landscape surrounding genes. Genes from the three subgenomes were treated separately. For all genes, the 10 kb upstream of the transcription start site (TSS) and 10 kb downstream of the transcription end site were analyzed. Abundance of the different TE families was compiled for all genes of each subgenome. The plots include only those superfamilies that are specifically enriched near genes and which are otherwise less abundant in intergenic sequences
Enrichment analyses of TE families within gene promoters. The y-axis represents the log2 ratio of the proportion (i.e., percentage in terms of number of copies) of each TE family observed in the promoter of genes (2 kb upstream the TSS) relative to their proportion at the whole subgenome level. Positive and negative values represent an over- and under-representation of a given family in the promoters, respectively. Log2 ratios were calculated for the three subgenomes independently (A green B violet D orange) and the three values were represented here as a stacked histogram. Only highly repeated families (500 copies or more) are represented, with 1 panel per superfamily. Families are ordered decreasingly along the x-axis according to the whole genome log2 ratio
Superfamily is generally but not always a good indicator of the enrichment of TEs in genic regions (Fig. 8). For instance, 83% (25/30) of the LINE families are over-represented in the promoter regions, while none of them is under-represented (considering a twofold change). We confirmed that class 2 DNA transposons (especially MITEs) are enriched in promoters, while Gypsy retrotransposons tend to be excluded from the close vicinity of genes. Indeed, among the 105 families strongly enriched in promoters (threefold change), 53% (56) are from class 2 and 21% (22) are LINEs, and only 5% (5) are LTR-RTs. Contrary to Gypsy, Mutator, Mariner, and Harbinger, families belonging to CACTA and Copia superfamilies do not share a common enrichment pattern: some TE families can be either over- or under-represented in promoters (Fig. 8). This confirmed previous results about CACTAs annotated along the 3B chromosome , revealing that a part of the CACTA families is associated with genes while the other follows the distribution of Gypsy. Our results showed that this is also true for Copia.
Thus, the TE turnover did not changed the highly organized genome structure. Given that not only proportions, but also enrichment patterns, remained similar for almost all TE families after A-B-D divergence, we suggest that TEs tend to be at the equilibrium in the genome, with amplification compensating their deletion (as described in ), and with families enriched around genes having remained the same.
No strong association between gene expression and particular TE families in promoters
We investigated the influence of neighboring TEs on gene expression. Indeed, TEs are so abundant in the wheat genome, that genes are almost systematically flanked by a TE in the direct vicinity. The median distance between the gene TSS and the closest upstream TE is 1.52 kb, and the median distance between the transcription termination site (TTS) and the closest downstream TE is 1.55 kb, while the average gene length (between TSS and TTS) is 3.44 kb. The density as well as the diversity of TEs in the vicinity of genes allow us to speculate on potential relationships between TEs and gene expression regulation. We used the gene expression network built by  based on an exhaustive set of wheat RNA-seq data. Genes were clustered into 39 expression modules sharing a common expression profile across all samples. We also grouped unexpressed genes to study the potential influence of TEs on neighbor gene silencing. For each gene, the closest TE upstream was retrieved, and we investigated potential correlations through an enrichment analysis (each module was compared to the full gene set). Despite the close association between genes and TEs, no strong enrichment for a specific family was observed for any module or for the unexpressed genes.
We then studied the TE landscape upstream of wheat homeolog triplets, focusing on 19,393 triplets (58,179 genes) with a 1:1:1 orthologous relationship between A, B, and D subgenomes. For each triplet, we retrieved the closest TE flanking the TSS and investigated the level of conservation of flanking TEs between homeologs. For 75% of the triplets, the three flanking TEs belong to three different families, revealing that, even in the close vicinity of genes, TEs are in majority not conserved between homeologs due to rapid turnover. This suggests that most TEs present upstream of triplets were not selected for by the presence of common regulatory elements across homeologs. However, for 736 triplets (4%), the three homeologs are flanked by the same element, constituting a conserved noncoding sequence (CNS), suggesting that part of this element is involved in the regulation of gene expression. These TE-derived CNSs are on average 459 bp, which is three times smaller than the average size of gene-flanking TE fragments (on average 1355 bp), suggesting that only a portion of the ancestrally inserted TEs are under selection pressure. They represent a wide range (149 different families) of diverse elements belonging to all the different superfamilies.
The majority of homeolog triplets have relatively similar expression patterns [26, 44], contrary to what was found for older polyploid species like maize . In synthetic polyploid wheat, it was shown that repression of D subgenome homeologs was related to silencing of neighbor TEs . Thus, we focused on triplets for which two copies are coexpressed while the third is silenced. However, enrichment analysis did not reveal any significant enrichment of specific TE families in promoters of the silenced homeologs. We also examined transcriptionally dynamic triplets across tissues . Again, no TE enrichment in promoters was observed. These results suggest that recent changes in gene expression are not due to specific families recently inserted in the close vicinity of genes.
Transposable elements play an important role in genetic expression and evolution
Until recently, little was known about how transposable elements contribute to gene regulation. These are little pieces of DNA that can replicate themselves and spread out in the genome. Although they make up nearly half of the human genome, these were often ignored and commonly thought of as “useless junk,” with a minimal role, if any at all, in the activity of a cell. A new study by Adam Diehl, Ningxin Ouyang, and Alan Boyle, University of Michigan Medical School and members of the U-M Center for RNA Biomedicine, shows that transposable elements play an important role in regulating genetic expression with implications to advance the understanding of genetic evolution.
Chromatin loops are important for gene regulation because they define a gene’s regulatory neighborhood, which contains the promoter and enhancer sequences responsible for determining its expression level. Remarkably, transposable elements (TEs) are responsible for creating around 1/3 of all loop boundaries in the human and mouse genomes, and contribute up to 75% of loops unique to either species. When a TE creates a human-specific or mouse-specific loop it can change a gene’s regulatory neighborhood, leading to altered gene expression. The illustration shows a hypothetical region of the human and mouse genomes in which four enhancer sequences for the same target gene fall within a conserved loop. In this example, a TE-derived loop boundary in the human genome (orange bar) shrinks the regulatory neighborhood, preventing two of four enhancers from interacting with their target gene’s promoter sequence. The net result is reduced gene expression in human relative to mouse. Looping variations such as these appear to be an important underlying cause of differential gene regulation across species and between different human cell types, suggesting that TE activity may play significant roles in evolution and disease.
Transposable elements move around the cell, and, unlike previously thought, the authors of this paper found that when they go to different sites, transposable elements sometimes change the way DNA strands interact in 3D space, and therefore the structure of the 3D genome. It appears a third of the 3D contacts in the genome actually originate from transposable elements leading to an outsized contribution by these regions to looping variation and demonstrating their very significant role in genetic expression and evolution.
The main component that determines 3D structure is a protein called CTCF. This study particularly focused on how transposable elements create new CTCF sites that, in turn, hijack existing genomic structure to form new 3D contacts in the genome. The authors show that these often create variable loops that can influence regulatory activity and gene expression in the cell. These findings were observed in human cells and mouse cells and show how transposable elements contribute to intraspecies variation and interspecies divergence, and will guide further research efforts in areas including gene regulation, regulatory evolution, looping divergence, and transposable element biology.
To streamline this work, the authors developed a piece of software, MapGL, to track the physical gain and loss of short genetic sequences across species. For example, a sequence that existed in the most common ancestor may have been lost somewhere or, inversely, could have been absent in the common ancestor but later gained in the human genome. MapGL enables predictions about the evolutionary influences of structural variations between species and makes this type of analysis much more accessible. For this paper, their input was a set of CTCF binding sites which were labeled by MapGL to show that a sequence gain/loss process explains many of the differences in CTCF binding between humans and mice.
With a background in computer science and molecular biology, Alan Boyle explains that he has always been interested in gene regulation. “It’s like a complex circuit: perturbing gene regulation through changes to the three-dimensional structure of the genome can have very different and wide-ranging outcomes.”
For Adam Diehl, this research continues the great discoveries that started in the late 1800s, when scientists first looked at the shape of chromosomes through microscopes. They observed the shape differences between cells, and noticed that the shape inside the nuclei remained the same between mother and daughter cells. Decades later, transposable elements were discovered at his alma mater, Cornell University: jumping genes could change the phenotypes of corn plants. In the 70s, because the genes between humans and chimpanzees are much too similar to explain the differences between the species, scientific focus shifted on how the genes are being used. For Diehl “It’s so exciting to be able to synthesize all this knowledge, and contribute to the next step of the story of species evolution.”
This research team will further study the impact of transposable elements on the 3D genome, but this time with a particular interest on a single human population sample rather than across species. The next steps will include experimental follow-up using a new sequencing method capable of identifying transposable element insertions that are variable across human populations. This method was developed in collaboration with Ryan Mills’s lab, at the University of Michigan, Medical School. It is expected that the next results will further the understanding of the regulatory role of the transposable elements with possible applications to neurodegenerative diseases.
Diehl, A.G., Ouyang, N. & Boyle, A.P. Transposable elements contribute to cell and species-specific chromatin looping and gene regulation in mammalian genomes. Nat Commun 11, 1796 (2020). https://doi.org/10.1038/s41467-020-15520-5
About the Authors
Adam Diehl is Research Computer Specialist in the Alan Boyle Lab, Department of Computational Medicine and Biology, University of Michigan
Alan P. Boyle, Ph.D. is Assistant Professor, Department of Computational Medicine and Bioinformatics (DCM&B), Department of Human Genetics, University of Michigan Medical School. Alan Boyle Lab
Ningxin Ouyang is a Doctoral Student, Bioinformatics Candidate in the Alan Boyle Lab
These authors contributed equally: Hyo Sik Jang and Nakul M. Shah.
Department of Genetics, Washington University School of Medicine, St Louis, MO, USA
Hyo Sik Jang, Nakul M. Shah, Alan Y. Du, Zea Z. Dailey, Erica C. Pehrsson, Paula M. Godoy, David Zhang, Daofeng Li, Xiaoyun Xing, Sungsu Kim, David O’Donnell, Jeffrey I. Gordon & Ting Wang
The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St Louis, MO, USA
Hyo Sik Jang, Nakul M. Shah, Alan Y. Du, Zea Z. Dailey, Erica C. Pehrsson, Paula M. Godoy, David Zhang, Daofeng Li, Xiaoyun Xing, David O’Donnell, Jeffrey I. Gordon & Ting Wang
Hope Center for Neurological Disease, Washington University School of Medicine, St Louis, MO, USA
Center for Gut Microbiome and Nutrition Research, Washington University School of Medicine, St Louis, MO, USA
David O’Donnell & Jeffrey I. Gordon
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
You can also search for this author in PubMed Google Scholar
H.S.J., N.M.S. and T.W. conceived and implemented the study. N.M.S., H.S.J., E.C.P., D.L. and T.W. contributed to the computational analysis. H.S.J. generated transcriptomic and epigenomic profiles of cell lines. H.S.J., X.X. and D.Z. performed the CRISPR-mediated deletion experiments. H.S.J. and Z.Z.D. performed the promoter-luciferase, motif mutagenesis and let-7 qPCR experiments. H.S.J. and A.Y.D. performed the growth and migration assays. H.S.J., A.Y.D., D.O. and J.I.G. performed the xenograft experiments. H.S.J., P.M.G. and S.K. performed the targeted methylation experiments. H.S.J. performed the rescue experiments. The manuscript was prepared and revised by H.S.J., N.M.S. and T.W. with input from all authors.
Interspecific hybridization between D. buzzatii and D. koepferae causes different changes in TE expression: Some TE families are more expressed in hybrids, whereas others are more expressed in parental species. TE overexpression in hybrids might be caused not only by a failure of TE regulation mechanisms but also by an increase in TE copy number. In our study, these two events cannot be distinguished, but they are considered to be linked to each other because transcription precedes transposition events (especially of retrotransposons). On the other hand, TE families that are underexpressed in hybrids might present more efficient repression mechanisms or simply a lower copy number in hybrids.
In ovaries, hybrid TE overexpression prevails over underexpression ( tables 1 and 2 and supplementary table S2 , Supplementary Material online). This concurs with several studies focused on a single or few TEs, where higher transcription levels in hybrids than in parents were observed ( Kawakami et al. 2011 Carnelossi et al. 2014 García Guerreiro 2015). At a whole-genome level, a few surveys also report cases of TE families underexpressed in hybrids, but these results are generally out of the main attention focus and consequently poorly discussed. For instance, in lake whitefish hybrids, approximately 38% of differentially expressed TEs are underexpressed ( Dion-Côté et al. 2014). Another well-studied case is that of hybrid sunflowers, where F1 hybrids present lower expression of the majority of TEs compared with parental species ( Renaut et al. 2014). The presence of both overexpressed and underexpressed TEs suggests that hybrid TE deregulation is more complex than previously expected and may depend on the TE family.
Functional Divergence between Parental piRNA Pathways Can Lead to Hybrid Incompatibilities
We demonstrate that TE families with differences higher than 2-fold in their piRNA amounts between D. buzzatii and D. koepferae are not more commonly deregulated than families with similar levels ( fig. 4). This shows that the maternal cytotype failure hypothesis cannot completely account for the observed pattern of TE deregulation, which is consistent with the similarity of TE landscapes between our parental species ( supplementary fig. S1 , Supplementary Material online). Thus, this explanation might be valid only for some particular TE families ( fig. 4).
On the other hand, sequence divergence between maternal piRNAs and paternal TE transcripts (and the reciprocal) could also lead to a decrease of silencing efficacy in hybrids. A genome-wide comparison of sequences within a TE family between parental species cannot be performed because sequenced TEs in D. koepferae are scarce and its genome has not been sequenced yet (see supplementary text S1 , Supplementary Material online, for a discussion on this putative bias). However, the presence of underexpressed TEs in hybrids, together with the knowledge that some TE families (such as Helena) are highly conserved between our parental species ( Romero-Soriano and García Guerreiro 2016), seems to rule out this explanation.
Therefore, our results point rather to the piRNA pathway global failure hypothesis, which states that accumulated divergence of piRNA pathway effector proteins is responsible for hybrid TE deregulation. In this way, we show that proteins involved in piRNA biogenesis and function are more divergent than expected between D. buzzatii and D. koepferae ( fig. 6). Consistent with this observation, previous studies in other Drosophila species have demonstrated that some of these proteins are encoded by rapidly evolving genes with marks of adaptive selection ( Obbard et al. 2009 Simkin et al. 2013). Furthermore, we find that almost all piRNA pathway genes present significant differences in expression between D. buzzatii and D. koepferae ( table 3). Such level of variability was also observed between different populations of a same species, D. simulans ( Fablet et al. 2014).
Drosophilakoepferae seems to produce higher amounts of piRNAs compared with D. buzzatii, which exhibits higher levels of ping-pong signature ( fig. 5). Those differences in global piRNA production strategies between parental species could be linked to the divergence and variability in expression between piRNA pathway genes. Indeed, the two main effectors of ping-pong amplification, Aub and Ago3, are more expressed in D. buzzatii than in D. koepferae (log2FC = 2.62 and 0.80, table 3), which is consistent with the higher ping-pong fraction detected in this species. Furthermore, an excess of Aub expression relative to Piwi could lead to a decrease of piRNA production due to a less efficient phased piRNA biogenesis. After the cleavage of a piRNA cluster transcript by Ago3 in the ping-pong cycle, the remnants of this transcript are loaded into Aub and processed to form the 3′-end of an antisense Aub-bound piRNA ( Czech and Hannon 2016). The excised fragment of the piRNA cluster transcript is then loaded into Piwi (and to a lesser extent, into Aub) and cut by Zucchini (Zuc) every 27–29 nucleotides, producing phased antisense piRNAs that allow sequence diversification ( Han et al. 2015 Mohn et al. 2015). We can hypothesize that an excess of Aub expression leads to a more frequent loading of this protein for phased piRNA production impairing the efficiency of phasing in D. buzzatii. This would lead to lower levels of piRNAs in D. buzzatii, which would mostly be produced by ping-pong amplification.
Contrary to Aub, qin is more expressed in D. koepferae than in D. buzzatii (log2FC = −1.30, table 3), which can be at the origin of the observed lower amounts of antisense piRNAs in D. buzzatii ( supplementary file S3 , Supplementary Material online). Qin is known to enforce heterotypic ping-pong between Aub and Ago3 by preventing futile homotypic Aub:Aub cycles, which mainly produce sense piRNAs ( Zhang et al. 2011). A recent study has demonstrated that homotypic Aub:Aub ping-pong also generates lower Piwi-bound antisense-phased piRNAs, because qin ensures the correct loading of Piwi with antisense sequences ( Wang et al. 2015). Therefore, a lower expression of qin (coupled with an excess of Aub) could lead to a less efficient production of antisense piRNAs (both secondary and phased) in D. buzzatii compared with D. koepferae. However, we must note that the remarkably higher expression levels of krimper in D. buzzatii (log2FC = 5.0, table 3) may diminish these effects, because krimper contributes to heterotypic ping-pong cycle formation by sequestering unloaded Ago3 proteins to prevent illegitimate access of other RNA sequences into them ( Sato et al. 2015 Webster et al. 2015).
Drosophilabuzzatii and D. koepferae seem to present a functional divergence of the piRNA pathway, which could likely be at the origin of TE misregulation in hybrids. However, contrarily to the observed in D. melanogaster–D. simulans artificial hybrids ( Kelleher et al. 2012), our hybrids do not exhibit deficient piRNA production. Indeed, global piRNA amounts in hybrids are higher than in D. buzzatii and resemble those observed in D. koepferae ( fig. 5B and supplementary file S3 , Supplementary Material online) and hybrid secondary piRNA biogenesis presents intermediate levels between parental species ( fig. 5A). Thus, incompatibilities in our hybrids may entail piRNA-mediated silencing effectors rather than proteins involved in piRNA biogenesis, even though both kinds of proteins are among those with the lowest identity percentages ( fig. 6).
Misexpression of SoYb, Hen1, and Panoramix Can Influence Hybrid TE Expression
Two of the piRNA pathway genes, SoYb and Hen1, are underexpressed in hybrids ( table 3). Hen1 is known to methylate piRNAs at their 3′-ends in both follicle and germ cells ( Horwich et al. 2007 Saito et al. 2007), but the impact of its mutation on TE expression may depend on the TE family. For instance, overexpression of HeT-A retrotransposon was observed in Hen1 mutants due to a higher instability of piRNAs ( Horwich et al. 2007), but other mutants exhibited an unchanged expression of retrotransposons ( Saito et al. 2007). SoYb seems to be involved in primary piRNA biogenesis and has a partially redundant function with its paralog BoYb ( Handler et al. 2011). Thus, even a complete gene loss of SoYb could be compensated by BoYb and would not lead to a widespread TE overexpression. Curiously, BoYb was underexpressed in D. simulans–D. melanogaster artificial hybrids ( Kelleher et al. 2012). Although downregulation of Hen1 and SoYb cannot explain the whole pattern of TE deregulation, we cannot dismiss it as a possible contributor to TE overexpression in some cases.
On the other hand, overexpression of Panoramix, known to be essential for TE transcriptional silencing ( Czech et al. 2013 Handler et al. 2013 Sienski et al. 2015 Yu et al. 2015), may compensate silencing deficiencies (especially at a posttranscriptional level) and be at the origin of TE underexpression.
TE Deregulation May Involve Other Mechanisms
We have shown that TE deregulation in hybrid ovaries may be related to the piRNA pathway in terms of 1) incompatibilities due to its divergence between parental species, 2) misregulation of some genes involved in TE silencing, and 3) differences between parental piRNA pools (for a few TE families). However, changes in this pathway may not explain the whole set of alterations of TE expression observed in hybrids.
For instance, the endo-siRNA pathway is known to silence TEs in somatic and germinal tissues, with a partially redundant function with the piRNA pathway in gonads ( Saito and Siomi 2010). Drosophilabuzzatii–D. koepferae hybrids do not present lower global levels of TE-related endo-siRNAs than parental species ( supplementary file S3 , Supplementary Material online), and the few ≥2-fold changes in TE-specific endo-siRNA populations seem to be positively correlated to changes in TE expression. Therefore, there is no evidence pointing out a high impact of endo-siRNAs in hybrid TE deregulation, although we cannot discard a mild role in somatic TE silencing. Unfortunately, our data do not allow the distinction between somatic and germinal elements (and related bibliography in our species model is virtually nonexistent), but the presence of the usually somatic gypsy elements among deregulated families ( tables 1 and 2) could indicate that some of them are indeed expressed in follicle somatic cells.
On the other hand, histone methylation marks linked with permissive or repressive chromatin states have frequently been associated with TE sequences and their surroundings ( Klenov et al. 2007 Yasuhara and Wakimoto 2008 Riddle et al. 2011 Yin et al. 2011). This has been shown to be tightly connected with the piRNA pathway: For instance, expression of piRNA clusters depends (directly or indirectly) on methylation marks ( Rangan et al. 2011 Goriaux et al. 2014 Mohn et al. 2014 Molla-Herman et al. 2015), and piRNA-mediated transcriptional silencing triggers the deposition of repressive H3K9me3 marks. Other mechanisms—such as endo-siRNAs and miRNAs—are also able to recruit this silencing machinery leading to heterochromatin formation ( Holoch and Moazed 2015). In D. buzzatii–D. koepferae hybrids, two of the upregulated miRNAs target two histone methyltransferase genes (Su(z)12 and Su(var)3-9). We could hypothesize that abnormal silencing of these genes might cause a failure in the deposition of histone modifications, resulting in abnormal TE expression.
Finally, two other TE defence mechanisms have been proposed to be activated in wild wheat hybrids: Deletion and methylation ( Senerchia et al. 2015). Even though DNA methylation is not common in Drosophila, internal or complete deletions of TE copies have been suggested to act as a prevention mechanism against TE genome invasions ( Petrov and Hartl 1998 Lerat et al. 2011 Romero-Soriano and García Guerreiro 2016). In that case, suppression of active insertions could reduce the RNA amounts of some TE families, contributing to their underexpression. Furthermore, recombination between copies is known to control R1 elements that are specifically inserted in 28S rRNA genes in Drosophila ( Eickbush and Eickbush 2014).
The pattern of TE deregulation observed in D. buzzatii–D. koepferae hybrids seems to be the result of several interacting phenomena involving different regulation pathways, as has been observed in plants during stress episodes ( Slotkin et al. 2009 Ito et al. 2011 Marí-Ordóñez et al. 2013 Creasey et al. 2014). For instance, when a de novo invasion of an active retrotransposon takes place ( Marí-Ordóñez et al. 2013), the action of the DCL4/Ago1 small RNA pathway (21-nt siRNAs) necessarily precedes the achievement of efficient TE silencing by another small RNA pathway (DCL3/Ago4, 24-nt siRNAs).
TE Deregulation across Generations of Hybridization
Interspecific gene flow between D. buzzatii and D. koepferae is a natural source of genetic diversity that can only be maintained through introgression of a parental genome in F1 females, as F1 males are all sterile ( Marin et al. 1993). Therefore, the study of backcrossed hybrids delves into the understanding of the real impact of hybridization in nature. We show that differences in ovarian TE expression between hybrids and parents are concordant with the expected D.buzzatii/D.koepferae genome fraction at each generation: F1 is equally distant from both parental species, whereas BC1 drifts apart from D. koepferae ( fig. 3A). Furthermore, the total amount of deregulated TE families is lower in BC1 (10.6% of the expressed TEs) than in F1 (15.2%): A generation of backcrossing seems to be sufficient to restore the regulatory mechanisms of some families, but not of the totality. A similar result was reported in inbred lines of Oryza sativa introgressed with genetic material from the wild species Zizania latifolia, where copia and gypsy retrotransposons were activated and then rapidly repressed within a few selfed generations ( Liu and Wendel 2000). F1 and BC1 ovaries exhibit the lowest number of differentially expressed TEs within one-to-one sample comparisons ( supplementary table S2 , Supplementary Material online) and present similar TE expression profiles ( fig. 2B). This points to the hypothesis that more generations would be necessary to restore TE expression to the parental levels. Indeed, if TE activation in hybrids is caused by the failure of different epigenetic mechanisms ( Michalak 2009), these are expected to be mitigated after several backcrosses thanks to the dominance of one of the parental genomes. In agreement to this hypothesis, we showed in a recent study that TE activation causes a genome expansion in D. buzzatii–D. koepferae hybrid females, but the C-value decreases after the first backcross ( Romero-Soriano et al. 2016).
Tendency to TE Repression in Hybrid Testes Demonstrates That TE Regulation Is Sex-Biased
We show that TE expression presents different patterns between ovaries and testes, both at the quantitative and qualitative levels ( fig. 2). Other studies have reported tissue-specific expression of transposons between male and female gonads. For instance, in D. simulans and D. melanogaster, transcripts of 412 are only found in testes ( Borie et al. 2002), I-like elements are more expressed in testes than in ovaries of D. mojavensis and Drosophilaarizonae ( Carnelossi et al. 2014), as well as are Osvaldo and Helena in D. buzzatii and D. koepferae ( García Guerreiro 2015 Romero-Soriano and García Guerreiro 2016). All these studies show higher transcript abundances in male gonads, which is consistent with the bias we observe toward testes overexpression compared with ovaries ( supplementary table S2 , Supplementary Material online).
These findings point out a differential TE regulation between male and female gonads, which was previously suggested by studies in Drosophila testes demonstrating that male piRNA biogenesis is not always performed by the same mechanisms as in ovaries ( Nagao et al. 2010 Siomi et al. 2010). Concordantly, we observe that testes have lower piRNA amounts and a less efficient ping-pong cycle than ovaries ( fig. 7). It has indeed been shown that piRNAs in testes are involved not only in TE repression but also in gene silencing, particularly of Stellate and vasa ( Nishida et al. 2007).
Our results on TE deregulation in hybrids fully support the idea of sex-specificity in TE silencing. Contrarily to ovaries, hybrid testes exhibit a bias toward TE underexpression compared with D. buzzatii ( supplementary table S2 , Supplementary Material online). Accordingly, the retrotransposon Helena was shown to exhibit lower transcript abundances in F1 testes than in D. buzzatii and D. koepferae ( Romero-Soriano and García Guerreiro 2016), as was the case for most TE families in a transcriptomic study in F1 sunflower hybrids ( Renaut et al. 2014). Although two other studies in Drosophila hybrids, focused on individual TEs, displayed the opposite effect ( Carnelossi et al. 2014 García Guerreiro 2015), we consider that disparity between specific studies fits in our global results.
TE underexpression prevalence in our hybrid testes can be explained by an increase of piRNA production and ping-pong signal in F1 testes ( fig. 7B and C). Thus, activation of piRNA biogenesis, especially through the ping-pong cycle, seems to be responsible for TE repression in testes. Consistent with this tight repression of TE activity in males, the genome size increase observed in D. buzzatii–D. koepferae hybrids occurs only in females, whereas the hybridization impact on male genome size is undetectable ( Romero-Soriano et al. 2016).
Considerations for analyses in non-humans
Many of the methods listed in Table 1 have been successfully applied to species other than human, and to transposable element varieties other than the non-LTR elements focused on in this review so far. For example Retroseq  has been applied to mouse genomes to detect LTR elements such as IAP and MusD in addition to the mouse varieties of LINE (L1Md) and SINE (B1/B2) elements . T-lex  and T-lex2  have been applied to Drosophila genomes, detecting a wide variety of different TE families. While non-LTR TEs in human have a consensus insertion site preference that is widespread in the human genome, other TE families have more specific integration site preferences. For example, the Ty1 LTR retroelement strongly prefers integration near Pol III transcribed tRNA genes and seems to associate with nucleosomes , while Tf1 elements (also LTRs) prefer nucleosome-free regions near Pol II promoters . Hermes elements (a type of DNA transposon) also prefer nucleosome-free regions and have a characteristic TSD sequence motif (nTnnnnAn) . Non-LTR retroelements can also have strong insertions site preferences as well, a prominent example being the R1 and R2 elements from Bombyx mori, which target 28S ribosomal genes  and have been used to dissect the biochemical steps involved in non-LTR integration . These various propensities to insert proximal to genomic features and have defined sequence characteristics at the insertion site could be used to filter insertion detections from WGS data for these TE families in non-human species, in combination with the general approaches already covered for non-LTR elements that have weaker insertion site preferences. Additionally, some of the characteristics of non-LTR retrotransposition presented so far may not apply to other TE classes and families and could lead to false negatives if putative insertions are inappropriately filtered against certain characteristics. For example, some DNA transposons (e.g. Spy) do not create target site duplications, so software that requires TSD will miss these . Other TEs have fixed TSD lengths, e.g. the Ac/Ds transposons in maize, famously initially described by McClintock in the 1950s , create an 8 bp TSD [60, 61], so a detector that allows Ac/Ds predictions with other TSD sizes might be more prone to false positives.
Capsid proteins are among the viral hallmark proteins , and their presence distinguishes viruses from other types of mobile genetic elements [19, 31, 32]. Here we show that Polintons and Tlr elements, currently classified as non-viral transposable elements, encode two key proteins required for virion formation, the DJR MCP and the penton protein, i.e. the major and minor capsid proteins. This finding combined with previous observations that these elements also encode a typical viral genome-packaging ATPase and adenovirus-like protease (absent in Tlr) make a strong case that Polintons and Tlr elements comprise a group of genuine viruses that we propose to denote ‘Polintoviruses’. Polintoviruses might have played key roles in the evolution of DNA viruses of eukaryotes, in particular adenoviruses, virophages, and possibly the NCLDV. Identification of actively reproducing Polintoviruses is an important experimental challenge.
Transposable Element Misregulation Is Linked to the Divergence between Parental piRNA Pathways in Drosophila Hybrids
Interspecific hybridization is a genomic stress condition that leads to the activation of transposable elements (TEs) in both animals and plants. In hybrids between Drosophila buzzatii and Drosophila koepferae, mobilization of at least 28 TEs has been described. However, the molecular mechanisms underlying this TE release remain poorly understood. To give insight on the causes of this TE activation, we performed a TE transcriptomic analysis in ovaries (notorious for playing a major role in TE silencing) of parental species and their F1 and backcrossed (BC) hybrids. We find that 15.2% and 10.6% of the expressed TEs are deregulated in F1 and BC1 ovaries, respectively, with a bias toward overexpression in both cases. Although differences between parental piRNA (Piwi-interacting RNA) populations explain only partially these results, we demonstrate that piRNA pathway proteins have divergent sequences and are differentially expressed between parental species. Thus, a functional divergence of the piRNA pathway between parental species, together with some differences between their piRNA pools, might be at the origin of hybrid instabilities and ultimately cause TE misregulation in ovaries. These analyses were complemented with the study of F1 testes, where TEs tend to be less expressed than in D. buzzatii. This can be explained by an increase in piRNA production, which probably acts as a defence mechanism against TE instability in the male germline. Hence, we describe a differential impact of interspecific hybridization in testes and ovaries, which reveals that TE expression and regulation are sex-biased.
Keywords: Drosophila buzzatii Drosophila koepferae RNA-seq interspecific hybridization piRNAs transposable elements.
© The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
—Crosses diagram. ( A ) is the first interspecific cross between D. koepferae…
—Transposable element expression summary. Dbu,…
—Transposable element expression summary. Dbu, D. buzzatii Dko, D. koepferae ♂♂,…
—TE differential expression analyses in…
—TE differential expression analyses in ovaries. ( A ) Differentially expressed TE families…
—Parental piRNA populations and TE…
—Parental piRNA populations and TE deregulation in ovaries. ( A ) Expression of…
—Characterization of piRNA populations in…
—Characterization of piRNA populations in parental and hybrid ovaries. Dbu, D. buzzatii …
—Distribution of identity percentages between…
—Distribution of identity percentages between D. buzzatii and D. koepferae in silico proteomes.…
—Differential expression analyses in testes.…
—Differential expression analyses in testes. Dbu, D. buzzatii ♂♂, testes ♀♀, ovaries. (…