Genome Assembly Using Reads

Genome Assembly Using Reads

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I'm taking an online bioinformatics class and I am stuck on a problem. The resources I have found don't help that much. Here is the problem:

"Assemble the error-free reads from a circular genome on the dataset below. You do not need to consider the reverse complement of the reads.


As the reads do not have the same length, convert the reads to 4-mers using the approach proposed by Idury and Waterman."

I believe this means I need to convert this into sets of 4 characters each. Do I simply break them up by taking the first 4 characters and last 4 characters like this?


And end up with a new dataset of: "CGTT TTCT CTAG GACG ACGT TAGA AGAC"?

Or is there more to it than that? Thanks for any help.

Actually the new data set will be,

for k = 4 (k-mer value)




A Primer of Molecular Biology

5.10.2 Gene Assembly

Genome assembly refers to the process of putting nucleotide sequence into the correct order. Assembly is required, because sequence read lengths – at least for now – are much shorter than most genomes or even most genes. Genome assembly is made easier by the existence of public databases, freely available on the National Center for Biotechnology Information website ( ). Just as it is much easier to assemble a picture puzzle if you know what the picture looks like, it is much easier to assemble genes and genomes if you have a good idea of the sequence order. In the human genome, genes occur in the same physical location on the chromosome, but there can be different numbers of copies and variable numbers of repeated sequence that complicate assembly. Although bacterial genomes are much smaller, genes are not necessarily in the same location and multiple copies of the same gene may appear in different locations on the genome. Therefore even with the availability of commercial software and ever growing reference databases, the process of genome assembly can take considerably longer than the time to obtain actual sequence.

You can go to the official website of QUAST and click on the DOWNLOAD button.

QUAST evaluates genome assemblies. For metagenome assembly evaluation, see For contig alignment visualization, see…

You will be directed to a SOURCEFORGE download page from where you can download the latest version (quast-5.0.2 when I was writing this article) of QUAST. The pre-compiled binaries will be downloaded and you can run it straight away after extracting.

You can see the following after executing or python .

Once you have ensured that QUAST is running correctly, we can start to assess some assemblies.


Genome sequencing, assembly, and annotation

We assembled the genomes of G. thurberi and G. davidsonii using data from both the nanopore long-read and the Hi-C short-read technologies. We produced 114.3 Gb and 108.3 Gb clean reads, respectively, for G. thurberi (

146×) and G. davidsonii (

135×) using the Nanopore platform (Additional file 1: Table S1-S2). After correction using the Illumina short reads, we generated a G. thurberi genome of 779.6 Mb with a contig N50 of 24.7 Mb the corresponding values for G. davidsonii were 801.2 Mb and 26.8 Mb (Table 1 and Additional file 1: Table S3) the sequence continuities are significantly improved for both species as compared with other recently reported genome assemblies [15, 16].

Using 284 million and 280 million valid Hi-C interaction pairs for the G. thurberi and G. davidsonii genomes, respectively (Additional file 1: Table S4), we anchored and oriented 777.2 and 799.2 Mb of the assembly onto 13 pseudochromosomes of G. thurberi and G. davidsonii respectively (Additional file 2: Fig. S1-S2), which represented more than 99.7% of the total assembly, indicating that our new assemblies reached a reference grade for quality. As an indication of the improved contiguity, the contig length for our G. thurberi genome represents a 940-fold increase compared to previously published G. thurberi sequences (24.7 Mb versus 0.026 Mb) [7], and our G. thurberi genome has a 3750-fold reduction in fragmentation (74 versus 277,903). Similarly, there was an 836-fold increase for G. davidsonii genome contig length (26.8 Mb versus 0.032 Mb) and 5150-fold reduction in fragmentation (104 versus 535,698). Moreover, the total assembly length and gene annotation number were all higher for our G. thurberi and G. davidsonii genome assemblies as compared to the recently reported G. thurberi and G. davidsonii genome resources [7]. Approximately 58.0% and 58.6% of the assembly sequences were annotated as repetitive sequences in the G. thurberi and G. davidsonii assemblies, respectively (Additional file 1: Table S5).

We next evaluated the assembly completeness by aligning the 192 and 212 million paired-end Illumina short reads against the G. thurberi and G. davidsonii genome assemblies and BUSCO [17] analysis, both methods showed that both assemblies are of high quality (Additional file 1: Table S6-S7 and Additional file 2: Fig. S3).

Genomic diversity among six D genomes

Our generation of high-quality genome assemblies for G. thurberi and G. davidsonii provides an opportunity to compare different D genome species that shared a common ancestor, potentially helping identify the post-divergence genomic rearrangements in cotton. The overall collinearities between the two newly assembly genomes are largely conserved, as supported by more than 78% of G. thurberi genome matching in one-to-one syntenic blocks with 80.6 % of the G. davidsonii genome. Similarly, we found approximately 78% of the G. thurberi genome matched in one-to-one syntenic blocks with

81% of the G. raimondii genome (Additional file 1: Table S8). And

77% G. davidsonii genome matched in one-to-one syntenic blocks with

83% of the G. raimondii genome. Our previous study showed that

86% of the G. raimondii genome matched in one-to-one syntenic blocks with the D subgenome of Gossypium hirsutum (Gh_Dt1), confirming that G. raimondii is a plausible donor species of allotetraploid cotton species [4].

We found inversions are the major rearrangement type among the different D genomes. The inversions between the two new assemblies span approximately 59.6 Mb in G. thurberi, a level similar to a previously reported comparison between G. raimondii and the TM-1 D subgenome [4]. Of particular note, we detected a large inversion on Chr11 between the G. thurberi and G. davidsonii occupying 20.4 Mb note, this was confirmed by mapping the Hi-C data for one accession against to the genome of the other, and vice versa (Fig. 1a-d and Additional file 2: Fig. S4-S5). Enlargements from the heatmaps revealed discontinuous signals for these inversions (in the region marked by the color triangle in Fig. 1b).

Characterization of genomic variation among different D genomes. a Genome comparison of among G. barbadense (D subgenome, Gb_Dt2), G. hirsutum (D subgenome, Gh_Dt1), G. raimondii (D5), G. davidsonii (D3), G. thurberi (D1), and G. turneri (D10). The inversions are marked in orange and magenta. b Identification of a large inversion on Chr11 between G. thurberi and G. davidsonii. The panel shows chromatin interaction heat maps including G. thurberi Hi-C data mapping G. thurberi (D1_map_D1) and G. davidsonii Hi-C data mapping G. thurberi (D1_map_ D3). The triangle marks the inversions in the heat maps. c Genomic comparison between G. thurberi and G. davidsonii on Chr11. d The panel shows chromatin interaction heat maps including G. davidsonii Hi-C data mapping G. davidsonii (D3_map_ D3) and G. thurberi Hi-C data mapping G. davidsonii (D1_map_D3). The triangle marks the inversions in the heat maps. e A/B compartments in Chr11 orange represents the A compartments and blue represents the B compartments. The transparent boxes indicate A-B compartment switching regions. f TAD heatmap around the right breakpoint of the large inversion on Chr11

Our finding that G. davidsonii, G. turneri, and G. raimondii share a conserved syntenic relationship for the large Chr11 inverted region supports that this Chr11 inversion is specific to G. thurberi. Further, we detected that G. thurberi Chr11 exhibits extensive B-to-A compartment switching specifically in a region neighbor the right breakpoint of the large inversion (Fig. 1e). And we also found that, relative to G. davidsonii, the topologically associating domains (TAD) were obviously extensively reorganized near the breakpoints of G. thurberi Chr11 breakpoints (Fig. 1f). We analyzed the conserved and switched A-B compartments in the inverted regions and at the whole-genome level. We found 39 conserved and 26 switched A-B compartments in the inverted regions, and the corresponding values for the whole genomes were 1045 and 532. Chi-square tests suggested that there was no bias towards A-B compartment switching in inverted regions (chi-square test, P = 0.2664). In contrast, we detected and obviously elevated proportion of reorganized TAD boundaries near the breakpoints (70 out of 190) when compared to the whole genome (1143 out 6184) (chi-square test, P < 0.0001) (Additional file 2: Fig. S6). These results offer empirical demonstrations showing that inversions in plant genomes can—in addition to their better understood impacts on one-dimensional linear genome sequences divergence—also drive divergence in TAD boundary formation.

In addition to Chr11, we also found some inversions from Chr01, Chr05, Chr06, and Chr12 that are specific to G. thurberi because they are shared by G. davidsonii and G. turneri (Fig. 1a). Similarly, some inversions from G. davidsonii include inversions from Chr02, Chr05, Chr06, Chr10, and Chr13, which are shared by G. raimondii and G. thurberi. Furthermore, we observed that in G. turneri (D10), G. hirsutum, and Gossypium barbadense (Gb_Dt2), most of inversions are species specific (Fig. 1a), indicating that such inversions have formed during species divergence such structural rearrangements could have directly contributed genetic novelty that contributed to such divergence.

Genomic landscapes of G. thurberi and G. davidsonii

As with most genomes, the G. thurberi and G. davidsonii sequences positioned near the telomere are enriched of coding genes while having a lower-than-average level of repeat sequences (Fig. 2a). Again as expected, the pericentromeric regions are enriched for repeat sequences but show a deficit for coding genes compared to the genome-wide average (Fig. 2a).

Gene family expansion among 11 cotton species. a Genomic landscape between G. thurberi and G. davidsonii genomes. (i) Genomes of G. thurberi (right panel) and G. davidsonii (left panel). (ii,iii) Transposable elements and gene density. (iv) 5mC DNA methylation levels. (v) 6mA DNA methylation levels. (vi) A and B compartments across the chromosome, orange indicates A compartments and blue indicates B compartments. (vii) Expression level based on RNA-seq analysis of leaves. The expression level was normalized by the number of reads per bin/(number of mapped read (in millions))×bin length (kb). (viii) InDel density between G. thurberi and G. davidsonii. (ix) SNP density between G. thurberi and G. davidsonii. (x) PAV density between G. thurberi and G. davidsonii. (xi) Syntenic block between G. thurberi and G. davidsonii. All data in panel (i)-(x) are shown in 500-kb windows. b A phylogenetic tree based on 7561 single-copy genes. The ratios of gene expansion and contraction of each branch are showed in the pie diagrams. The digits present the number of gene families which have experienced expansion or contractions. c,d KEGG pathway enrichment of the gene families which have experienced expansion or contraction in G. thurberi and G. davidsonii

Our RNA-seq (coding and non-coding) expression profiling of G. thurberi and G. davidsonii (young leaves) showed that sequences in pericentromeric regions are expressed at generally lower levels compared to sequences in chromosome arms (Fig. 2a). We next examined small variations (InDels and SNPs) between G. thurberi and G. davidsonii. The InDel densities showed a decreasing pattern, and the SNP densities showed an increasing tendency moving from the telomere region to the centromere region (Fig. 2a and Additional file 2: Fig. S7).

We next detected presence/absence variations (PAVs) and identified a total of 14,401 of G. thurberi-specific genomic PAVs and 15,684 of G. davidsonii-specific genomic PAVs, occupying 39.5 Mb and 52.0 Mb in G. thurberi and G. davidsonii genomes. The PAVs are evenly distributed across the chromosomes, with most of the PAVs being shorter than 10 kb (Fig. 2a and Additional file 2: Fig. S8a).

A total of 490 and 570 PAV-localized genes were identified as G. thurberi- or G. davidsonii-specific genes. Approximately 39.6% and 37.4% of the PAV genes from G. thurberi and G. davidsonii had apparent orthologs from at least one of the three other Gossypium species, underscoring that a relatively small proportion of these PAV genes were present in the ancestral genome (Additional file 2: Fig. S8b). PAV genes without obvious orthologs in the examined Gossypium species are likely to have arisen during the divergence and may represent sources of impactful genes that have contributed to the speciation and presently adapted characteristics of G. thurberi and G. davidsonii.

Evolution within and between eleven cotton genomes

We also compared new coding genes from the new assemblies with Gossypium arboreum (A2), Gossypium australe (G2), G. raimondii (D5), G. turneri (D10), and the D subgenomes of the five allotetraploid cotton species (G. hirsutum, G. barbadense, Gossypium tomentosum, Gossypium mustelinum, and Gossypium darwinii). Our phylogenetic tree supports a monophyletic origin for the allotetraploid species that was likely derived from a hybridization between G. raimondii and an A genome species (Fig. 2b). A total of 35,454 orthologous groups were identified through orthoMCL, and as expected, the G. australe (G2) and G. arboreum (A2) has more unique genes than those of D genome species, because the genomic divergences are more significant in diverse chromosomal groups than within a single group (Additional file 2: Fig. S9a). GO analysis revealed enrichment for “DNA recombination,” “DNA integration,” and “DNA metabolic process” among the unique gene sets for G. thurberi and G. davidsonii (Additional file 2: Fig. S9b-c).

Using G. arboreum and four D genome species (G. thurberi, G. davidsonii, G. raimondii, and G. turneri), we evaluated the divergence times between the diploid A genome and four D genome species and found they apparently diverged between 5.07 and 5.13 MYA, and the four D genomes diverged between 1.51 and 2.04 MYA (Additional file 2: Fig. S10). Within the D genome clade, the greatest extents of divergence were detected between G. turneri and the other 3 species, then the followed divergence was between G. raimondii and the other 3 species, and the most recent divergence was between G. thurberi and G. davidsonii (Fig. 2a).

We next used CAFE (Computational Analysis of gene Family Evolution) to estimate gene family expansions and contractions among the 23,825 ortholog groups, which revealed that 8 out of the 11 tested species have experienced more gene family expansions than gene family contractions (P < 0.05) (Fig. 2b). This is informative when considered against gene family dynamics known for the D subgenomes of allotetraploid cottons: our finding that a relatively higher proportion of species-specific gene families have experienced expansion or contractions in diploid D genome species compared with gene family dynamics in D subgenomes support that this form of genome divergence is less active in the D subgenomes than in the D genome species (Additional file 2: Fig. S11).

We detected that G. thurberi has experienced expansion for genes related to steroid biosynthesis and brassinosteroid biosynthesis, as well as for genes encoding pectinesterase enzymes (Additional file 2: Fig. S12). Given the reported roles of these biochemical pathways and enzymes in diverse stress tolerance responses, perhaps such expansion has contributed to the previously reported Verticillium dahliae resistance of G. thurberi [18]. Enriched genes specific to G. davidsonii included genes which function in photosynthesis and oxidative phosphorylation pathways, in the photosystem I reaction center (PsaB), and in the photosystem II reaction center (psbD and psbE) are enriched in G. davidsonii (Additional file 2: Fig. S13), results clearly suggesting that the potential for differential photosynthetic capacities in G. davidsonii.

Epigenetic modifications variations in 3D structure

Both PacBio and Nanopore can distinguish modified bases from standard nucleotide bases in plants [19, 20]. However, the accuracy of SMRT sequencing for detecting DNA methylation is known to be heavily affected by the sequence coverage [21, 22]. We used the nanopore data to analyze the global landscape of epigenetic modifications on chromosomes. The global N6-methyldeoxyadenine (6mA) level is approximate 1.1% of all adenines for G. thurberi and 1.3% for G. davidsonii, these proportions are much higher than previous reports about G. hirsutum and G. barbadense that were based on PacBio sequencing data [19]. For both G. thurberi and G. davidsonii, the 6mA distribution is uneven across the chromosomes, for example exhibiting enrichment at both the middle regions of chromosome arms and in pericentromeric regions (Fig. 2a), findings supporting the proposal from a rice study that the genomic distribution of 6mA is not random [23]. A comparison of the G. davidsonii genome methylation frequencies generated using Nanopore technology with the methylation frequencies obtained through whole-genome bisulfite sequencing technology showed an excellent correlation between the two methods (R = 0.88). Among the three types of methylation (CHG, CG, and CHH), CHG showed the highest correlation (0.95), followed by CG (0.92) and CHH (R = 0.77) (Additional file 2: Fig. S14).

The chromosome can be experimentally delineated into open (A) or closed (B) compartments, and these A/B compartments can be further divided into smaller TADs. We found that A compartments tended to cluster at chromosome arms, while B compartments tended cluster near pericentromeric regions (Fig. 2a). Approximately 41.5% of the G. thurberi genome belongs to A compartments this was 42.3% for G. davidsonii and 42.0% for G. raimondii genome. Note that these A/B compartment ratios are similar with ratios previously reported for allotetraploid D subgenomes [24].

We further evaluated the epigenetic features in the A/B compartments for G. thurberi and G. davidsonii in an analysis using 100-kb windows. For both the G. thurberi and G. davidsonii genomes, the gene densities were much higher in the A compartments than the B compartments (Fig. 3c). Further, it was intriguing to observe that the levels of both 5mC (CG, CHH, and CHG) and 6mA were significantly lower in A compartments than in B compartments (Fig. 3a). Similarly, the TE content was much lower in A compartments as compared to B compartments (Fig. 3b).

Methylation features of 3D chromatin. a–c Methylation level, TE ratio, and gene density in A and B compartments in G. thurberi and G. davidsonii. A two-sided Wilcoxon signed-rank indicates there were significant differences at **P < 0.001. d Methylation feature around TAD boundaries. The methylation levels in TAD boundaries (orange lines) flanking 100 kb was compared with those methylation levels in random genome regions (blue lines). The lines on the right side (0 to 100 kb) indicate TAD regions, and the lines on the left side (− 100 to 0 kb) indicate TAD regions when TADs were organized consecutively or non-TAD regions when one TAD was not closely adjacent to the others. e Gene distribution around the TAD boundaries. The method for extracting genomic regions around boundaries was the same as that in panel d. f A-B compartment switching between G. thurberi (D1) and G. davidsonii (D3) or between G. raimondii (D5) and G. davidsonii (D3). g Comparison of TAD boundaries between G. thurberi and G. davidsonii (D1_Vs_D3) or G. raimondii and G. davidsonii (D5_Vs_D3)

We also analyzed epigenetic modifications around TAD boundaries and found that chromatin surrounding the TAD boundaries in both of the examined cotton species had relatively lower levels of 5mC (CG, CHG, and CHH) and 6mA compared against randomly sampled genomic regions (Fig. 3d). Notably, there is enrichment for ORF sequences at TAD boundaries, suggesting that epigenetic modifications may apparently contribute to the differential activation of genes positioned at TAD boundaries.

To check for higher-order structural variations possibly related to the divergence of D genome species, we compared 3D structures among G. thurberi, G. davidsonii, and G. raimondii. Specifically, we constructed chromatin interaction maps for G. thurberi, G. davidsonii, and G. raimondii at 50 kb resolution, and as expected, the frequency of intra-chromosomal interactions displayed a rapid decrease with extended linear distance (Additional file 2: Fig. S15-S17). This analysis revealed strong rewiring of chromatin interactions in the inverted regions, consistent with a model of distinct territories formed by individual chromosome arms (Additional file 2: Fig. S18). For instance, G. thurberi carrying a pericentromeric inversion on Chr11 showed preferential interactions between these loci when present on the same chromosome arm (Additional file 2: Fig. S18).

We compare the organization of A/B compartments between the G. thurberi and G. davidsonii or between G. davidsonii and G. raimondii. A total of 57.8 Mb and 44.7 Mb in the G. thurberi and G. raimondii genomes represented apparent A-to-B compartment switching as compared with the compartment status data for G. davidsonii (Fig. 3f). Similarly, a total of 28.9 Mb and 28.1 Mb of genome regions apparently represent B-to-A compartment switching between the G. thurberi and G. raimondii genomes (Fig. 3f), findings highlighting that B-to-A switching and A-to-B switching are uneven among the diploid D genomes. We also checked the potential for differential expression of genes located in the A/B switching regions: among 3189 A/B switching genes between G. thurberi and G. davidsonii, 556 were DEGs. Among 3670 A/B switching genes between G. raimondii and G. davidsonii, 613 were DEGs. These findings support the previous idea that only a small subset of genes are transcriptionally affected by compartment changes [25].

We next compared the TAD boundaries and found that more than 90% of the G. thurberi and G. raimondii TAD boundaries were conserved in G. davidsonii (Fig. 3g), indicating that the TAD boundaries have been relatively strongly conserved among sister species after divergence.

Long-range interactions in G. thurberi and in G. davidsonii

Long-range chromatin interactions functionally contribute to gene transcriptional regulation, but very little is known about 3D chromatin interactions in cotton. Seeking to characterize the pattern of long-range chromatin interactions, we conducted a genome-scale analysis and annotated the Hi-C peaks positioned within 2-kb upstream or 1-kb downstream of gene TSSs as “proximal Hi-C peaks” (P) all of the others were annotated as “distal Hi-C peaks” (D). We identified 22,328 P and 8304 D involved in long-range chromatin interactions in G. thurberi G. davidsonii had 22,816 P and 8808 D involved in long-range chromatin interactions (Fig. 4a).

Long-range interactions between the proximal and distal regulatory regions. a An example of long-range interactions on Chr08 in G. thurberi and G. davidsonii. b Distribution of long-range interactions in each chromosome. c The long-range interactions were divided into the P-P, P-D, and D-D interactions. d Comparison of all interactions between G. thurberi and G. davidsonii. e Comparison of P-D interactions between G. thurberi and G. davidsonii. f Violin plots for long-range interactions in G. thurberi and G. davidsonii. The center red line in plot indicates the median, and the black lines indicate the upper and lower quartiles of insertion time. g Summary of the number of P–D interaction with variable distances in G. thurberi and G. davidsonii. h Comparison of expression level for genes interacting or not interacting with chromatin interactions (**P < 0.0001, a two-sided Wilcoxon signed-rank test). i Transcriptional status for genes with or without chromatin interactions. “Inactive” represents gene with FPKM < 0.1 “Active” represents gene with FPKM ≥ 0.1. “w” indicates genes with chromatin interactions “w/o” indicates genes without chromatin interactions. j An example of one D interacted with two P in G. davidsonii. In the upper panel, the orange and blue lines represent Hi-C links in G. thurberi and G. davidsonii. respectively, and the blue boxes represent the genes locating in the interaction loop. The middle panel indicates the gene (Gd07G24850) around the P1 and P2. The lower panel is the read coverage generated by mRNA-seq

We also classified all of these interactions into three groups: proximal–proximal (P–P), proximal–distal (P–D), and distal–distal (D–D). A total of 47,604 and 51,367 intra-chromosomal interactions were identified for G. thurberi and G. davidsonii, respectively. Approximately 60% of these interactions were P-P interactions, followed by P-D (

10%) (Fig. 4c). Comparison of the average number of formed loops is informative: we found that one D can form an average of 1.56 or 1.62 loops with P for G. thurberi and G. davidsonii, respectively in contrast, one P can form an average of 1.34 or 1.31 loops with P. So, on average, D undergo more interactions than P, and it appears that genes regulated by D prefer to cluster together in the genome.

The number of interactions in each chromosome ranged from 2759 to 4417 in G. thurberi and 2999 to 4763 in G. davidsonii (Fig. 4b). There were 44,675 intra-chromosomal interactions were identified both in G. thurberi and G. davidsonii, while 2936 and 6693 interactions were specific to G. thurberi and G. davidsonii. We found 27,531 P-P interactions, 18,752 P-D interactions, and 5465 D-D interactions were conserved between G. thurberi and G. davidsonii, whereas 1043 and 2597 P-P interactions, 1578 and 3782 P-D, and 761 and 1067 D-D interactions were respectively specific for G. thurberi and G. davidsonii (Fig. 4d). This result emphasizes that only a small subset of intra-chromosomal interactions is divergent between cotton sister species.

Strikingly, more than 73% of the promoters of these cotton genomes have 3 or more P-P interactions, with most promoters having about 1 P-D (Fig. 4e). We found that the median length of intra-chromosomal interactions were 90 kb and 100 kb for G. thurberi and G. davidsonii, respectively (Fig. 4f). Previous studies in human have showed that enhancers prefer to regulate nearby genes [26]. Most of the P-D interactions were within 100 kb for both species, and fewer than 6% of these interactions were larger than 300 kb (Fig. 4g).

We next generated the transcriptome datasets using mRNA-Seq of G. thurberi and G. davidsonii leaves to help experimentally unravel the relationships between chromatin interactions and the transcriptional activation of cotton genes. We compared the expression profiles of genes with or without chromosomal interactions and found that genes with chromatin interactions had relatively higher expression levels than those without interactions (P < 0.0001, Wilcoxon rank-sum test) (Fig. 4h). Although the chromatin interactions were captured by Hi-C, we found that

40% of the genes with interactions not expressed or expressed very lowly (FPKM < 0.1) (Fig. 4i). However, between 43 and 46% of genes which had no chromatin interactions in either G. thurberi and G. davidsonii were expressed in leaves, a level slightly higher than from a report for such genes in an analysis of shoots and immature ears in maize [26] (Fig. 4i).

We then examined the intersection of the differentially expressed genes between G. thurberi and G. davidsonii and differential P-D interaction genes to explore the possible roles of enhancers on the gene expression. Among the genes with differential P-D interactions between G. thurberi and G. davidsonii, there 509 genes which significantly altered expression levels, and these genes exhibited enrichment for GO terms including “response to biotic stimulus” and “defense response” (Additional file 2: Fig. S19). An example of these genes is the homeobox gene Gd07G248500, which encodes an ortholog of AtHB16, which is known to regulate leaf development and photoperiod sensitivity in Arabidopsis thaliana [27]. We found that the promoter of Gd07G248500 interacted with 2 D peaks in both G. thurberi and G. davidsonii, but the interaction intensities were stronger in G. davidsonii than those in G. thurberi (15.13_vs_0.06 and 12.94_vs_1.2), which may promote its expression in G. davidsonii leaves (Fig. 4j), a situation like the maize PSB1 that had a shoot-specific P-D interaction with a higher expression in shoot than that in immature ear [26].

Gene-order and structural variation between G. thurberi and G. davidsonii

To analyze gene order, a total of 32,981 orthologous pairs were identified between G. thurberi and G. davidsonii, among which 1104 orthologous gene pairs were located in the inversion regions (Table 2) these account for

3.3% of the total analyzed orthologous gene pairs. The fact that this represents a higher proportion than that between the G. hirsutum cultivars TM-1 and ZM24 supports that more genes are affected by interspecies inversions compared to intraspecies inversions.

67%) orthologous gene pairs with only missense mutations in their CDS or non-frameshift InDels between G. thurberi and G. davidsonii. However, only

9% of the orthologous gene pairs had no amino acid changes between G. thurberi and G. davidsonii, and approximately 2% and 1% of these pairs had no variation in coding sequence (CDS) or gene bodies (CDS and intron regions), respectively (Table 2). These proportions are significantly lower than those from the comparison of G. hirsutum cultivars TM-1 and ZM24 (71%, 69%, and 56%), indicating that interspecies orthologs are more divergent than intraspecies homologs. Note that more than 9% of the syntenic orthologous genes pairs carried large-effect mutations, including 3n ± 1 InDel, start-codon mutation, stop-codon mutation, splice-acceptor mutation, and splice-donor mutation in the CDS regions (Table 2). More than 11% of the syntenic orthologous gene pairs were impacted by large structural variations, with 85% of these having lost at least one exon any biological significance of these variations will require further study.

We also characterized extent of gene amplification in the G. thurberi and G. davidsonii genomes. More than 3400 tandem duplicated genes were identified in both G. thurberi and G. davidsonii, among which the stress-related pathways phenylalanine metabolism, glutathione metabolism, plant-pathogen interaction, and phenylpropanoid biosynthesis were found to be enriched in a KEGG analysis, indicating that tandem duplication has apparently enhanced the tolerance of G. thurberi and G. davidsonii to various stresses (Additional file 2: Fig. S20). In total, 3136 and 3154 genes were identified as singleton genes in G. thurberi and G. davidsonii, respectively (Additional file 1: Table S9). It was notable that there was a much higher proportion of transcription factors in the whole-genome duplication and segmental duplication sets than those from singleton genes (Fisher’s exact test, P < 2.2e−16), supporting our previous finding [4] that transcription factors have a tendency to be retained after duplication (Additional file 1: Table S10).

Identification of centromeres using a Hi-C heatmap method

Centromeres are mainly composed of repetitive retrotransposons and satellite repeats, and the challenge of accurately assembling centromeres using short-read sequencing data is well-documented [28] accordingly, centromere evolution is poorly understood. Previous studies of Hi-C matrices have shown that centromeres form a unique type of interacting subcompartment which can function as a barrier and prevent intra-chromosomal arm interactions [29]. By exploiting the insulation feature of centromeres in Hi-C heatmap data, we successfully developed a new method for centromere characterization based on Hi-C data.

In this method, we first map the Hi-C contact data against its corresponding reference genome to obtain valid read pairs (Fig. 5a). Next, we use the valid read pairs to generate a Hi-C heatmap (at 50 kb resolution), and then use this to search regions which apparently form barriers to intra-chromosomal arm interactions. Testing confirmed that these regions, which have less frequent contacts between chromosome arms on either side compared with their frequency of intra-arm contact, are indeed centromeres (Fig. 5b). Thirdly, based on the phylogenetic relationship, we used the known cotton centromeric LTRs to align against the reference genomes to validate these Hi-C centromeres (Fig. 5c). Finally, the centromere sequence features—including sequence composition, LTR insertion time, LTRs insertion pattern, and centromeric enriched LTRs—can be cataloged systematically to support studies of centromere evolution (Fig. 5d,e). Using this new method (Additional file 2: Fig. S21), we successfully identified the centromeres in the model plant Arabidopsis thaliana, Oryza sativa, and the new G. thurberi and G. davidsonii assemblies (Additional file 2: Fig. S22-S25).

An overview of centromere identification based on Hi-C data. a A diagram of Hi-C data mapping against the reference genome. b Characterization of centromeres in Hi-C heat maps. The left panel shows chromatin interactions, including G. davidsonii mapped to G. thurberi (D3_map_D1) and G. thurberi mapped to G. thurberi (D1_map_D1). The middle panel presents a genomic alignment around the centromeres. The three-dimensional rings indicate the centromeres. The right panel shows chromatin interactions, including G. davidsonii mapped to G. davidsonii (D3_map_D3) and G. thurberi mapped to G. davidsonii (D1_map_D3). The regions within the orange lines are the centromere regions. c Validation the centromeres by centromeric LTR (Centromere Retroelement Gossypium, CRG) BLAST analysis. The data showed the validation on Chr08. d Centromere feature analysis. The right panel presents a comparison of the repetitive elements for centromeres vs. the whole genome. The middle shows LTR insertion time distributions for centromeres specifically, and for the whole genome. The center red line in the plot indicates the median, and the black lines indicate the upper and lower quartiles for insertion times. The right panel shows an analysis of the intact LTR insertion pattern. An example is presented for G. thurberi Chr04. The digits present the insertion time of nearby LTRs. e Analysis of centromere LTR enrichment. The left panel represents the sequence identity characteristic of a “CentLTR” sequence, as examined in centromeres and non- non-centromeric regions in four D genomes. The right panel is the identity distribution pattern of CenLTR hits presented as a dot plot. This analysis detected a total of 152,285 CenLTRs in D1 centromeres, with 163,217 in D1 non-centromeric regions 158,815 in D3 centromeres, with 139,231 in D3 non-centromeric regions 16,093 in D5 centromeres, with 76,875 in D5 non-centromeric regions and 80,537 in Gh_Dt1 centromeres, with 246,791 in Gh_Dt1 non-centromeric regions

As we used nanopore long reads for our new genome assemblies, the centromeres are well assembled with the excellent coverage (Additional file 2: Fig. S26), thereby providing an unprecedented opportunity to study cotton centromere evolution. As we aligned G. thurberi against the G. davidsonii genome, we clearly found that there were no collinearities in the middle region of each orthologous chromosome (Additional file 2: Fig. S27). Chromosomal collinearity analysis showed that many non-syntenic regions were located in the centromeric regions (Additional file 1: Table S11), indicating that the centromeric regions have higher divergence compared to their neighboring (flanking) regions.

To further support this, we aligned the previously reported G. raimondii and G. hirsutum CENH3 ChIP-Seq data against the four genomes available for D genome species (G. thurberi, G. davidsonii, and Gh_Dt1) (Additional file 2: Fig. S28). We detected a strong peak in a narrow region on G. raimondii Chr08 when mapping G. raimondii ChIP-Seq data (Additional file 2: Fig. S28a). However, upon mapping G. raimondii ChIP-Seq data against the other three examined Gossypium genomes (G. thurberi, G. davidsonii, and G. raimondii), the signals were dispersed over a broader region, with no obvious major peaks. Mapping G. hirsutum CENH3 ChIP-Seq data against the G. hirsutum genome revealed an apparent peak on D12 no major peaks were detected when we mapped this data to the four other D genomes (Additional file 2: Fig. S28b). These findings underscore that centromeric regions can be highly divergent among closely related species.

We also mapped the G. thurberi Hi-C data against the G. davidsonii assembly and vice versa, we observed large gaps in the centromeric regions this indicates that centromeric sequences from the orthologous chromosomes in G. thurberi and G. davidsonii were highly divergent (Fig. 5b and Additional file 2: Fig. S3-4). Although the centromeric regions are highly divergent (without any syntenic blocks), we found that the flanking regions of the centromeres are highly conserved with good collinearities. For Chr03, Chr04, Chr07, and Chr08, no large-scale inversions were detected between orthologous chromosomes, highlighting that chromosomes arms are highly syntenic and lack obvious changes in their centromeric positions. Although there were inversions located in the chromosome arms in Chr05, Chr06, and Chr10, we observed that these inversions had no effect on centromere locations, since the centromeric flanking regions retained synteny. Chr01, Chr02, Chr09, Chr11, Chr12, and Chr13 experienced pericentromeric inversions that is, we observed that the collinearities of flanking regions were reversed between the two genomes, suggesting that inversions spanning the centromere occurred after divergence.

Centromere LTRs have undergone rapid changes

We next examined whether there were any local sequence similarities among the centromeres from non-homologous chromosomes. We used the NCBI blastn tool to align the centromere sequences, and filtered the results with a loose filter (block length larger than 2000 bp with 95% identity). We observed that the centromeric sequences are highly repetitive, and detected more similar sequences from the intraspecies comparison than the interspecies comparison, indicating that centromeres have experienced duplication after speciation (Additional file 2: Fig. S29). Moreover, we found that the sequences from G. davidsonii are more similar, indicating that the duplications occurred later than those from G. thurberi.

The DNA sequences of plant centromeres usually contain many copies of simple tandem repeats, which occur in head-to-tail arrays only those which are associated with CENH3 nucleosomes are considered to be part of the functional centromere [30]. However, our understanding of the role of these sequences in centromere function remains rudimentary at best. Unlike centromere tandem repeats in many plants [31], we found that the tandem repeat content is very low in G. thurberi and G. davidsonii (Fig. 5d). Instead, we observed strong enrichment for LTRs (especially for Gypsy-type retrotransposons), suggesting that cotton centromeres have arisen from retrotransposons.

We used Kimura to analyze LTR insertion times, which revealed that LTRs in centromeres are younger than those at the whole-genome level among all D genomes (D1, D3, D5 and Gh_Dt1) (Fig. 5d). The LTRs in G. davidsonii centromeres are younger than those from G. thurberi (median of 1.336 MYA vs. 1.979 MYA), indicating that centromeres in G. davidsonii have been much more active than those of G. thurberi and supporting that the centromeres in G. davidsonii experienced expansion compared with those from G. thurberi (Fig. 5d). Unlike the nested insertion of full-length LTRs previously reported for Brassica nigra and some cereal centromeric regions [20], we detected full-length LTRs that were independently inserted into the centromeric region, e.g., in Chr04 of G. thurberi, and we identified 16 intact Gypsy-type LTRs that have inserted into centromeres between 1.49 and 9.31 MYA (Fig. 5d).

We constructed a phylogenetic tree of all the LTRs to describe the pattern of diversity (Additional file 2: Fig. S30). Three subclades were mainly found in the centromeric region these were all quite distinct in sequence from the D cotton genome LTRs from non-centromere regions (Additional file 2: Fig. S30a). Moreover, we found that the LTRs from G. davidsonii tend to cluster together in the phylogenetic tree, as did those from G. thurberi, findings which indicate that the LTRs of the centromeres in G. thurberi and G. davidsonii have proliferated and spread after these two species diverged from their common ancestor (Additional file 2: Fig. S30b).

We next identify and characterize the centromeric LTRs by mapping all the intact LTRs in the G. thurberi and G. davidsonii genomes with blastn. One LTR from Chr12 (26,780,294–26,783,754) of G. thurberi had significant BLAST hits for centromeres of each orthologous chromosome in G. thurberi and G. davidsonii (Fig. 5e), and we detected a variety of highly similar sequences throughout the centromeres (this LTR type was designated as “CenLTR”). Further, alignments clearly indicated strong divergence from centromere LTR types (GhCR1-GhCR4) from G. hirsutum (Additional file 1: Table S12).

We further aligned the G. raimondii and Gh_Dt1 genomes and found that the CenLTRs are also enriched in the centromeric their regions, indicating that CenLTRs are apparently widely distributed in the centromeres of D genome species. We compared the sequence identities between the centromeres and the non-centromere sequences for each species. A lot of CenLTR polymorphisms were detected between G. davidsonii centromeres and G. davidsonii non-centromere sequences (Fig. 5e). Similar CenLTR polymorphisms were evident between Gh_Dt1 centromeres and non-centromere sequences (Fig. 5e). Surprisingly, the identity with consensus sequence was lower in the centromeric regions compared with non-centromeric regions (Fig. 5e), indicating that the LTRs have undergone rapid changes in the centromeres.

Divergent evolution of genes involved in stress tolerance

As the D subgenome donor of the widely cultivated upland cotton, G. raimondii is known to have contributed stress tolerance traits to allotetraploid cotton [32]. Nevertheless, allotetraploid cotton is sensitive to Verticillium dahliae infection and to growth in high salinity soils these represent major challenges facing cotton production worldwide, and a lack genetic resources for improving plant tolerance to these challenges is a major constraint in current cotton breeding programs. Here, we found that G. thurberi seedlings are more tolerant to Verticillium dahliae than G. raimondii, indicating that G. thurberi is a promising resource for upland cotton improvement (Fig. 6a). We identified 3472 and 5042 genes associated to tolerance to Verticillium dahliae in G. thurberi and G. raimondii, respectively. We identified a total of 106 genes including NB-LRR, NPR1/3/4, TGA, and downstream transcriptional factors (e.g., WRKY33, SARD1, and CPB60g) potentially involved in disease responses based on their differential responses to the Verticillium dahliae treatments between G. thurberi and G. davidsonii (Fig. 6b). The SA biosynthesis signal pathway was activated in G. thurberi, as the PAD4, EDS1, SAMT, and SBPB2 genes were upregulated in G. thurberi upon Verticillium dahliae challenge (Fig. 6c). We overexpressed WRKY33 (Gthurberi12G176500) genes in G. hirsutum to test whether the genes from wild cotton can be used in cultivated cotton improvement. As expected, the overexpression lines displayed improved upland cotton tolerance to Verticillium dahliae, indicating that G. thurberi can be understood as an important genetic resource for cotton breeding (Additional file 2: Fig. S31).

Models depicting the molecular basis of Verticillium wilt and salt stress tolerance in G. thurberi and G. davidsonii. a Phenotypic comparison of G. thurberi (D1) and G. raimondii (D5) seedlings (35-day-old seedlings) in response to challenge with Verticillium dahliae. Photographs were taken under normal conditions or 14 days after challenge with Verticillium dahliae. b Heat maps for differentially expressed genes with annotations related to salicylic acid (SA) signaling, NB-LRR, and WRKYs. Genes with an adjusted P value < 0.05 and an absolute value of log2[foldchange] > 1 found by EdgeR were designated as differentially expressed. c A proposed model showing that the SA signaling pathways enhance Verticillium wilt tolerance in G. thurberi. V. dahliae attack induces SA biosynthesis via the isochorismate synthase (ICS) and phenylalanine ammonia-lyase (PAL) pathways in plastids. Enhanced disease susceptibility (EDS1) and phytoalexin deficient 4 (PAD4) are required for increased SA accumulation. SA methyltransferase (SAMT) catalyzes SA to MeSA, which diffuses into the cytoplasm, where it is converted back to active SA by SABP2. The red and blue digits in brackets represent the upregulated genes in D1 and D5, respectively. d Phenotypic comparison of G. davidsonii (D3) and G. thurberi (D1) seedlings in response to salt stress treatment (250 mM NaCl watering 21-day-old seedlings every 2 days). Photographs were taken under normal conditions or 14 days after treatment with NaCl solution. e Heat maps for differentially expressed genes with annotations related to ABA, ethylene, and CBL-CIPK pathways. Genes with an adjusted P value < 0.05 and an absolute value of log2[foldchange] > 1 found by EdgeR were designated as differentially expressed. f Transcriptional network related to salt response in G. raimondii and G. davidsonii. Ethylene biosynthesis, calcium signaling, and vacuole NHX are activated in G. davidsonii. The NCED3 gene encodes the enzyme which catalyzes the first step of ABA biosynthesis. The ABA signaling pathway, comprising PYR/PYL/RCAR, PP2C, and SnRKs proteins, is a major plant hormone involved in salt stress responses. Ethylene biosynthesis is catalyzed by the SAM, ACS (ACC synthase), and ACO (ACC oxidase) enzymes. The ethylene signaling pathway includes ethylene receptor, CTR1, and EIN2. TPK (two-pore potassium) is K+ channel that trafficks K+ out of the vacuole. NHX1 (tonoplast-based Na+/H+ exchanger) is required for sequestration of excessive Na + and Cl − in the vacuole. The red and blue digits in the brackets represent the upregulated genes in D3 and D5, respectively

Unlike G. thurberi, G. davidsonii displayed significant salt tolerance in seedlings when compared with G. raimondii (Fig. 6d). A total of 14 ethylene-related genes (including SAM, ACS, ACO, EIN4, CTR, and EIN3) showed differential responses to salt treatment between G. davidsonii and G. raimondii (Fig. 6e). Genes of the CBL-CIPK pathway showed differential responses to salt between G. davidsonii and G. raimondii, with the CIPK and NHX genes being upregulated by salt treatment of G. davidsonii (Fig. 6e). Moreover, we found that other well-known stress-related genes including ERFs, GRASs WRKY, NACs, and MYBs were upregulated in G. davidsonii upon salt treatment (Fig. 6f) such genes have likely played important roles in species divergence and have likely contributed to the spread of the cotton D genome sister species in their adaptation to new ecological contexts and environments.


There are no gold standards for genome assembly and annotation. However, the availability of NGS data (particularly TGS data) and their analytical tools has enabled the sequencing of several high-quality genomes of species of importance in aquaculture in recent years. Beginners and small research groups still face challenges, because genome assembly and annotation are usually complex analytical procedures (or pipelines) requiring interdisciplinary collaborations (from biology to computer science) and hefty costs for refining/maintaining the genome. The recommendations addressed here are broad guidelines that could be considered to avoid common pitfalls throughout the whole-genome assembly and annotation process. However, the comprehensive features (e.g., advantages and disadvantages) of each step and/or technology have not been extensively discussed.

Finally, newly emerging technologies and analytical tools could dramatically improve end-to-end genome assemblies and annotations in the future by replacing the years-long efforts of the past with rapid and low-cost solutions. Meanwhile, emphasis should be placed upon the following: First, define the achievable research aim. Second, avoid the trap of trying to secure a perfect/complete genome assembly and annotation, which could lead to a never-ending project. Third, perform assembly and annotation to gain firsthand experience, including in bioinformatics. Fourth, seek internal and external help and advice from experts. Lastly, be open to sharing genomic data to both increase research productivity and promote public awareness.

3.� Bruijn Graphs: Standard, Multisized, and Paired

3.1. Terminology

Since various assembly articles use widely different terminology, below we specify a terminology that is well suited for PDBGs. All graphs considered below are directed graphs. A vertex w precedes (follows) vertex v in a graph G if there exists an edge from w to v (from v to w) in G. indegree (v) ( outdegree (v)) is the the number of vertices preceding (following) v. A vertex v in a graph G is called a hub if indegree (v) 1 or outdegree (v) 1. A directed path in G is called a hub-path (abbreviated h-path) if its starting and ending vertices are hubs and its intermediate vertices are not hubs. Obviously, each edge in the graph belongs to a unique h-path. An edge is called a hub-edge (abbreviated h-edge) if it starts at a hub. There is a correspondence between h-paths and h-edges: the first edge on each h-path is an h-edge, and the h-edge is unique to that h-path. 6 Given an h-edge α, we define the h-path starting at α as path (α) and denote the number of edges in this h-path (h-path length) as | path (α)|. If a is the i-th edge in an h-path (1 ≤ i ≤ | path (α)|) starting from an h-edge α, we define h - edge (a) = α and offset (a) = i ( Fig. 1 ).

Notation for decomposing a de Bruijn graph into non-branching paths (h-paths). A de Bruijn graph on reads ACCGTCAGAAT and ACCGTGAGAAT with edge size k =𠂔, vertex size k −𠂑 =𠂓. Hubs are shown as solid vertices, while vertices with indegree 1, outdegree 1 are hollow. An h-path CGT → GTG → TGA → GAG →𠂚GA (shown in red with h-edge denoted α) defines an h-read CGTGAGA. The whole path is denoted path (α), and consists of | path (α)| =𠂔 edges. The edges on this path have offsets 1, 2, 3, 4, as indicated. Each edge can be addressed by its path's h-edge and its offset.

3.2. Standard de Bruijn graphs

An n-mer is a string of length n. Given an n-mer , we define prefix and suffix .

For the rest of the paper, we fix a positive integer k. For a set R eads of strings (thought of as the DNA sequencing reads over the alphabet <A, C, G, T>), let N be the number of k-mers that occur in strings in R eads as substrings. We define the de Bruijn graph DB(R eads , k) as follows ( Fig. 2 ):

D1. Define an initial graph G0 on 2N vertices. For each k-mer a that occurs in strings in R eads as a substring, introduce two new vertices u, v and form an edge u → v. Label the new edge by a, u by prefix (a), and v by suffix (a). Note that we label edges by k-mers and vertices by (k −𠂑)-mers.

D2. Glue vertices of G0 together if they have the same label.

Standard and multisized de Bruijn graph. A circular G enome CATCAGATAGGA is covered by a set R eads consisting of nine 4-mers, . Three out of 12 possible 4-mers from G enome are missing from R eads (namely ), but all 3-mers from G enome are present in R eads . (A) The outside circle shows a separate black edge for each 3-mer from R eads . Dotted red lines indicate vertices that will be glued. The inner circle shows the result of applying some of the glues. (B) The graph DB(R eads , 3) resulting from all the glues is tangled. The three h-paths of length 2 in this graph (shown in blue) correspond to h-reads ATAG, AGGA, and GACA. Thus R eads 3,4 contains all 4-mers from G enome . (C) The outside circle shows a separate edge for each of the nine 4-mer reads. The next inner circle shows the graph DB(R eads , 4), and the innermost circle represents the G enome . The graph DB(R eads , 4) is fragmented into 3 connected components. (D) The multisized de Bruijn graph DB(R eads , 3, 4).

In the de Bruijn graph DB(R eads , k), an h-path passing through n vertices defines an (n + k −𠂒)-mer called a hub-read (abbreviated h-read) ( Fig. 1 ). Substituting every h-path in DB(R eads , k) by a single edge labeled by its h-read results in a condensed de Bruijn graph.

We define R eads k as the set of all h-reads in DB(R eads , k). Obviously, DB(R eads , k) =�(R eads k, k).

We define the coverage of an edge in the de Bruijn graph DB(R eads , k) as the number of reads that contain the corresponding k-mer. The coverage of a path is defined as the average coverage of its edges.

3.3. Multisized de Bruijn graphs

The choice of k affects the construction of the de Bruijn graph. Smaller values of k collapse more repeats together, making the graph more tangled. Larger values of k may fail to detect overlaps between reads, particularly in low coverage regions, making the graph more fragmented. Since low coverage regions are typical for SCS data, the choice of k greatly affects the quality of single-cell assembly. Ideally, one should use smaller values of k in low-coverage regions (to reduce fragmentation) and larger values of k in high-coverage regions (to reduce repeat collapsing). The multisized de Bruijn graph (compare with Peng et al. [2010] and Gnerre et al. [2011]) allows us to vary k in this manner.

Given a positive integer δ < k, we define R eads k-δ,k as the union of δ +𠂑 sets: . The multisized de Bruijn graph, DB(R eads , k−δ, k) is defined as DB(R eads k-δ,k, k). Figure 2 shows the standard de Bruijn graphs DB(R eads , 3) (tangled) and DB(R eads , 4) (fragmented) as well as the multisized de Bruijn graph DB(R eads , 3, 4), which is neither tangled nor fragmented.

3.4. Practical paired de Bruijn graphs: k-bimer adjustment

In this presentation, we focus on a library with bireads having insert sizes in the range dinsert ± Δ. The genomic distance between two positions in a circular genome (and k-mers starting at these positions) is the difference of their coordinates modulo the genome length. For example, the genomic distance between a pair of reads of length ℓ oriented in the same direction with insert size dinsert is dinsert − ℓ. In the ensuing discussion, we will work with genomic distances between k-mers rather than insert sizes.

Medvedev et al. (2011a) introduced paired de Bruijn graphs (PDBG), a new approach to assembling bireads similar approaches were recently proposed in Donmez and Brudno (2011) and Chikhi and Lavenier (2011). While PDBGs have advantages over standard de Bruijn graphs when the exact insert size is fixed, Medvedev et al. (2011a) acknowledged that PDBG-based assemblies deteriorate in practice for bireads with variable insert sizes and raised the problem of making PDBGs practical for insert size variations characteristic of current sequencing technologies.

Since it is still impractical to experimentally generate bireads with exact (or nearly exact) distances, we describe a computational approach that addresses the same goal. A k-bimer is a triple (a|b, d) consisting of k-mers a and b together with an integer d (estimated distance between particular instances of a and b in a genome). SPAdes first extracts k-bimers from bireads, resulting in k-bimers with inexact distance estimates (inherited from biread distance estimates). The k-bimer adjustment approach transforms this set of k-bimers (with rather inaccurate distance estimates) into a set of adjusted k-bimers with exact or nearly exact distance estimates. Similarly to error correction, which replaces original reads with virtual error-corrected reads (Pevzner et al., 2001), k-bimer adjustment substitutes original k-bimers by adjusted k-bimers. The adjusted k-bimer can be formed by two k-mers that were not parts of a single biread in the input data. We show that k-bimer adjustment improves accuracy of distance estimates and the resulting assembly.


During the last decade, developments in DNA sequencing technology have led to a surge in the number of eukaryotic genomes being published. The bulk of these genomes belong to animals, plants, and fungi, while single-celled eukaryotes (protists) remain largely absent 1 . This is unfortunate as protists have an enormous diversity of cellular morphologies, physiology, and genetics, possibly even more so than their multicellular relatives 2 . Although there has been a recent increase in the number of available protist genomes e.g. 3,4,5 , some groups are still completely devoid of any genomic information 6 .

One protist group, of which we have no genomic information is the green algal order Dasycladales whose species have a very characteristic cellular morphology. Despite being unicellular, and having only a single nucleus, some species can grow to more than 10 cm in length 7 . Acetabularia acetabulum is the most studied species of Dasycladales. This umbrella-looking organism is elongated in an apical-basal direction with the root-like rhizoid in the basal end and a disc-shaped cap in the apical end, separated by a long stalk 7,8 .

The size and highly elaborate cellular morphology, together with a large and distinct nucleus, made Acetabularia an attractive model system for studies of cell biology and genetics. Already in the 1930s, Joachim Hämmerling used Acetabularia to prove that cellular morphogenesis was influenced by so-called “morphogenetic substances” (later confirmed to be RNA) which were produced by the nucleus and distributed to the rest of the cell 9 . By transplanting and exchanging the apical and basal parts between A. acetabulum and A. crenulata, he observed that the cap developed into the morphology of the basal donor, demonstrating that the nucleus-containing rhizoid was in control of the morphogenetic fate of the cell 10,11 .

Despite its popularity and importance in early cell biology and genetics, the interest in A. acetabulum and its sister species dropped towards the end of the 1990s. As of yet, no attempt has been reported to sequence and assemble the genome of A. acetabulum, or any other dasycladalean species. The lack of Dasycladalean genomes, and protist genomes in general, can to a large extent be explained by the challenge of obtaining sufficient levels of genomic DNA required for sequencing. The A. acetabulum cell has a life cycle of 3 months when cultivated in a highly nutritious media 12 and cultures cannot be grown densely (maximum 25 algae in 50 ml) 11 . Typical library preparation protocols for whole genome sequencing depend on several hundred nanograms of input DNA, which equates to thousands of A. acetabulum individuals for a single sequencing sample. Considering the potentially enormous size of the A. acetabulum genome, with the diploid nuclear genome estimated to be 1.85 pg (ca. 1.8 Gb) based on flow-cytometry measurements 13 , this further increases the demand for DNA input.

In order to solve the challenges of limited DNA material, several methods for amplification of genomic DNA have been developed. The earliest whole genome amplification (WGA) methods were based on short-length PCR amplifications with random or degenerate primers 14,15 . These methods often recovered only small fractions of the genomes and were hugely influenced by biases introduced by PCR amplification 16,17 . The most promising development in WGA has been the use of multiple displacement amplification (MDA). This method utilizes the phi29 polymerase which copies DNA with high fidelity and incorporates more than 70,000 nucleotides without falling off the template, resulting in large stretches of amplified DNA 16,18 . However, there are several challenges associated with the phi29-based MDA method. First, as the MDA method relies on random priming, the priming and amplification do not distinguish between target and possible contaminant DNA in the sample. De novo assemblies can therefore be challenging if databases lack target genomes or contaminant sequences 19 . Second, and again like PCR-based amplification methods, MDA is also prone to amplification bias. Observations made on bacterial genomes amplified by MDA have shown that certain genomic regions seem to be more readily amplified than others, creating highly uneven coverage across the genome 19,20,21,22 . MDA-generated data therefore rarely results in full genome recovery. López-Escardó et al. 23 used MDA to amplify the genome of three cells of the protist Monosiga brevicollis and showed highly uneven coverage and a genome recovery of 6–36% from each cell when mapping to a reference assembly, and Mangot et al. 24 recovered about 20% of the genome when assembling cells of the protist group MAST-4. However, both studies highlighted the importance of amplifying the DNA from several cells, as this greatly increased recovery 23,24 .

A promising method to reduce bias associated with MDA, and thereby increase genomic coverage, is to divide the amplification reaction into nano-sized droplets, a method referred to as droplet MDA (dMDA) 25,26,27 . The idea behind dMDA is to isolate the target DNA fragments into tiny droplets and thereby reducing the competition of encountering a polymerase, leading to a more uniform amplification and overall improved genome coverage. Marcy et al. 25 tested the effect of droplet MDA by detection of 10 gene loci from 14 dMDA and 12 standard MDA reactions of E. coli cells and found that all 10 loci were found in all 14 dMDA samples, but that several loci were missing from multiple standard MDA samples. In addition, samples generated with dMDA displayed a much more uniform amplification (measured by copies of loci/ul) than the samples generated by standard MDA. Likewise, the genome recovery from sequencing E. coli cells was increased from 59% with standard MDA to 89% using dMDA 28 .

The goal of the present study was to genome sequence and de novo assemble the genome of A. acetabulum. To obtain sufficient genomic DNA for sequencing we have amplified DNA from single embryonic cells using dMDA. We present an assessment of the sequencing data produced by single-cell dMDA, and its usefulness for assembly of large eukaryotic genomes. In addition, we compare three different assembly strategies assembling each single-cell dMDA library separately, assembling these individual assemblies into a meta-assembly, and assembling all the sequencing libraries combined (co-assembly). This study is among the first to use single-cell dMDA for sequencing and de novo assembly of a eukaryote genome and should serve as a useful reference for future attempts to sequence species that are difficult to cultivate or collected from the environment.


The red-spotted grouper Epinephelus akaara (E. akaara) is one of the most economically important marine fish in China, Japan and South-East Asia and is a threatened species. The species is also considered a good model for studies of sex inversion, development, genetic diversity and immunity. Despite its importance, molecular resources for E. akaara remain limited and no reference genome has been published to date. In this study, we constructed a chromosome-level reference genome of E. akaara by taking advantage of long-read single-molecule sequencing and de novo assembly by Oxford Nanopore Technology (ONT) and Hi-C. A red-spotted grouper genome of 1.135 Gb was assembled from a total of 106.29 Gb polished Nanopore sequence (GridION, ONT), equivalent to 96-fold genome coverage. The assembled genome represents 96.8% completeness (BUSCO) with a contig N50 length of 5.25 Mb and a longest contig of 25.75 Mb. The contigs were clustered and ordered onto 24 pseudochromosomes covering approximately 95.55% of the genome assembly with Hi-C data, with a scaffold N50 length of 46.03 Mb. The genome contained 43.02% repeat sequences and 5,480 noncoding RNAs. Furthermore, combined with several RNA-seq data sets, 23,808 (99.5%) genes were functionally annotated from a total of 23,923 predicted protein-coding sequences. The high-quality chromosome-level reference genome of E. akaara was assembled for the first time and will be a valuable resource for molecular breeding and functional genomics studies of red-spotted grouper in the future.


Conservation applications

We have summarized information on current methods for whole-genome sequencing, assembly and annotation, with the aim of providing practical guidance for conservation or ecology-oriented research groups moving into the field of genomics. The focus has been on large and complex genomes of nonmodel organisms relevant from a conservation perspective. In the introduction, we outlined a number different ways in which genomic resources in general, and a complete genome sequence, in particular, can be applied in a conservation biology setting (see also Fig. 1). Conservation genomics being a young field, examples where genomic resources have been put to the test in an applied conservation context are still limited, but a few such cases may be worth highlighting.

One of the first nonmodel genomes to be sequenced using the Illumina technology was the giant panda (Li et al. 2010 ). While the focus of the panda genome paper was not on conservation issues, follow-up studies have utilized the draft genome to make inferences about population structure, adaptive genetic variation and demography (Wei et al. 2012 ). Likewise in the Aye-Aye, resequencing data from twelve individuals from different parts of Madagascar were utilized to infer fine-scale genetic population structure and conduct landscape genetic analyses. The results were used to provide guidance for allocation of conservation resources towards preserving large and contiguous habitats in northern Madagascar (Perry et al. 2013 ). Genomic resources have further been utilized in breeding programs of the Tasmanian devil, which is endangered in the wild due to a contagious facial cancer. The generation of a reference genome sequence in combination with genomewide resequencing data has made it possible to investigate many details of this disease, including the identification of candidate genes involved in tumorigenesis (Murchison et al. 2012 ). Similarly, genomic resources have been utilized to limit the spread of a developmental disease causing mutation in breeding programs of the California condor (Romanov et al. 2009 ). Finally, genomewide SNP screening has been effective in several studies of fishery stock monitoring and management (Primmer 2009 Nielsen et al. 2012a ).

Future directions

With rapid progress in sequencing nano-technology and further development of computational methods, we can expect that all steps of the workflow will continue to be improved. New library preparation protocols will enable sequencing from less starting material, producing libraries with longer and more precisely estimated insert sizes and generating longer reads with reduced error rates. The development of more efficient assembly algorithms and increasing computational power will make the bioinformatic data processing amenable to a larger spectrum of users. As the costs involved in genome sequencing and assembly continues to drop, the generation of a draft genome sequence will soon become routine, also for species with large genomes. This development will mean that even small research groups with limited funding will soon be expected to develop genomic resources for their species of choice, reinforcing the use of genomic approaches in conservation biology and related disciplines. The possible development of rapid and compact sequencing solutions that may be applied directly in the field situation would be particularly useful for many conservation applications. Another important area of progress lies in the usage of low-quality samples, obtained from noninvasive sampling or museum material that would allow monitoring of genomic diversity through time. Developing ways of storing and sharing genomic data will also be crucial, to make the most efficient use of these resources for conservation. In spite of these promising developments, we need to be aware that science alone is not sufficient to meet future conservation challenges. The technical transition from conservation genetics to genome-scale data therefore needs to be tightly accompanied by a discussion of how applied conservation biology can best benefit from genomic data (see for example McMahon et al., 2014 ). This discussion needs to be taken at the general level on a case-by-case basis and involves scientists and political decision makers alike.

Watch the video: Human Genome Project Sequencing: Private Project (September 2022).


  1. Shakalmaran

    Between us speaking, I have tried to decide this problem.

  2. Kenjiro

    And what in this case?

  3. Weatherly

    In my opinion, they are wrong. Let us try to discuss this. Write to me in PM, speak.

  4. Rypan

    I think you will allow the mistake. Write to me in PM.

  5. Delsin

    I really wanted to talk to you.

Write a message