Gene flow analysis with a single mitochondrial marker

Gene flow analysis with a single mitochondrial marker

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am currently studying a mud snail genus called Ecrobia. Several species of this genus occur in both the Mediterranean and Black seas. Therefore, I would like to investigate whether there is:

  1. Intraspecifc gene flow between the Mediterranean and Black sea populations
  2. Interspecific gene flow between the Ecrobia spp. occurring in the same water body

Therefore, I have a couple of questions:

  • Is it possible to perform a gene flow analysis with a single mitochondrial marker (COI)? If so, which program could I use to run the analysis?
  • Ideally, how many individuals per population are required? If I don't have enough individuals, is there a rigorous way of clustering populations?

Mitochondrial DNA markers reveal genetic connectivity among populations of Osteoglossiform fish Chitala chitala

Genetic diversity and population structure in Indian featherback fish, Chitala chitala (Hamilton, 1822) was investigated by combined analyses of two full mitochondrial genes, ATPase 6/8 and Cytochrome b. A total of 403 individuals, collected from 14 rivers yielded 61 haplotypes. Hierarchical partitioning analysis identified 19.01% variance ‘among’ and 80.99% variance ‘within groups and populations’. The mean coefficient of genetic differentiation (FST) was observed to be significant 0.26 (p < 0.05). Mantel tests rejected the hypothesis that genetic and geographic distances were correlated. The patterns of genetic differentiation, AMOVA and principal coordinate analyses indicated that natural populations were sub-structured and comprised of four genetic stocks of C. chitala in Indian rivers. The results also supported the higher resolution potential of concatenated gene sequences. The knowledge of genetic variation and divergence, from this study, can be utilized for its scientific conservation and management in the wild.

Tracking history with the Y-chromosome

William Sanchez of Albuquerque, New Mexico, was fascinated with genealogy and so had his DNA tested in 2001. Certain customs in his family made him ponder his ancestors who had been colonists in Mexico nearly a half millennium before his birth. He knew they had come from Spain, but there were things that his family did that anthropologists would call ‘crypto-Jewish’. They lit candles on Friday nights, covered mirrors when people died, and did several other things that as a boy Sanchez had assumed were normal for a Spanish-speaking Roman Catholic family in the Southwestern United States.

At age 52, though, he knew better, so when he got a call from a scientist at the genetic testing company, he was pretty sure they were going to tell him that he had some Jewish ancestry. Spain, after all, had expelled its Jewish population in 1492, except for those agreeing to convert to Christianity. Some had converted but had continued their Jewish practices in secret, and many had fled Spain for the Americas. It therefore seemed plausible that some genetic sequences common in Jews might be part of his DNA.

Sanchez was puzzled when the scientist on the phone told him that the testing showed he had genetic sequences of priests. Poetically, it was fitting: Sanchez actually was a priest –a Catholic priest, but it didn’t actually make sense.

"I mean that you’re a Kohane," the scientist explained. "You share ancestry with the Jewish priesthood." The company was telling Sanchez that he had a genetic marker that was patrilineal – passed down from father to son, because it was a Y-chromosome haplotype – and identifiable because it was carried by people with a particular historical role in Jewish culture.

Y-chromosome haplotypes

Haplotypes are genetic sequences that we inherit from only one parent. This is different from autosomal genes, which are genes on a numbered chromosome and usually affect males and females in similar ways. Although a small portion of the Y-chromosome has homology, or similarity of genes at the same place, with a region of the X-chromosome, where crossing over can occur between a dozen or so genes, the rest of the Y-chromosome is unique and non-homologous. It is passed down purely from father to son with no recombination. In this way, it is like a surname in many cultures, or for that matter like the Jewish priesthood: passed down from father to son.

While Y-chromosome haplotypes are inherited as single genetic units without shuffling, they are subject to random mutation, just like the rest of the genome. Harmful mutations are weeded out of the population through natural selections, but many mutations can stick around. For example, natural selection does not eliminate mutations that occur in an intron (a non-coding region of a gene) or in non-coding regions between genes.

The DNA polymerase enzyme is extremely efficient, so random mutation is rare. On average, it happens only once per hundred million base pairs of DNA per generation. Knowing this rate, we can use mutations in haplotypes as a kind of molecular clock. Based on the amount of variation in haplotypes in different people, scientists can produce family trees estimating the common ancestry of those people.

Y-chromosome family tree

In the mid-1990s, the discovery of numerous Y-chromosome haplotypes enabled scientists to start building a Y-chromosome family tree, consisting of various branches all stemming back to a common trunk representing the common ancestor of all Y-chromosome haplotypes (Figure 2). This common ancestor is called Y-MRCA, for "most recent common ancestor." This is sometimes called the Y-chromosome Adam, and it tracks back to 200,000 – 300,000 years ago.

Figure 2: This family tree traces various branches all stemming back to a common ancestor of all Y-chromosome haplotypes. This common ancestor is called Y-MRCA, for "most recent common ancestor," or Y-chromosome Adam.

It is important to note that, while all living men have a Y-chromosome that is descended from a single male living in the Pleistocene epoch, there were many other men living at that time and many of them also left many descendants. The Y-chromosome Adam is just the man that all living men have in common through our patrilineal ancestry. Plenty of Adam’s contemporaries also left many descendants, just not through a purely patrilineal line.

In 1997, researchers in Haifa, Israel, and Toronto, Canada applied the same principles to test the claims that the Jewish priesthood began with a single man who lived more than 3,000 years ago. Among DNA from Jewish men from many different countries, the researchers identified two sequences, or markers, that were much more common in Jewish priests (Kohanim singular Kohane) compared with Jews in general. A year later, the team identified four additional markers common in Kohanim, and they designated the six markers as the J1 Cohen Modal haplotype.

The Cohen Modal haplotype is found in both Ashkenazi and Sephardi Jews, the two largest Jewish populations. It also exists among non-Jewish whose ancestry is in the Middle East, but there are great differences in frequencies of the six different markers that comprise the haplotype. Notably, the two markers that the team identified initially (called YAP and DYS19) were found to be present in 55 percent of Kohanim. The rate differed among Ashkenazi and Sephardi populations, the two main Jewish populations 58 percent of Sephardi Kohanim showed the markers versus 48 percent of their Ashkenazi counterparts. But the rates in non-Kohanic Jews and in non-Jews were found to be much lower, meaning that Kohanim represented a distinct population, stemming from a common Y-chromosomal ancestor.

By comparing the Kohanic markers of the J1 Cohen haplotype to other Y-chromosome haplotypes, researchers were able to place Jewish priests on the Y-chromosome family tree. This allowed them to date the common ancestor for the Kohanic priests of both Ashkenazi and Sephardi Jewish populations to 2,400 to 3,000 years ago (Figure 3).

Figure 3: A haplotype tree diagram of the Kohanim population's Y-chromosomal family tree. image © Chriscohen

Sanchez was told he was a priest because he had the Cohen Modal haplotype, but perhaps more surprising is that the same haplotype was also found among the priests of the Lemba tribe in Zimbabwe, South Africa, Mozambique, and Malawi. The Lemba had always claimed to be part of the Jewish People, but anthropologists had been dismissing the idea for over a century. When molecular geneticists tested the Lemba priestly class called the Buba in the early 2000s, not only did they harbor the Cohen Modal haplotype, they carried it at a rate of 65 percent. They were Kohanim, with a patrilineal line even more pure than that of the Kohanim of Ashkenazi and Sephardi Jews. That these African Jews more closely resemble their fellow Africans than they do their fellow Jews of Europe and the Middle East tells of the extensive intermarriage of the Jewish diaspora with local populations. Despite this, the Jewish blood lines were maintained as revealed by their Y-chromosomes.

_____ are genetic sequences that we inherit from only one parent.

Materials & Methods


Thirteen F. graminearum strains were sequenced individually on the Illumina MiSeq platform (Table 1). In addition, F. graminearum strain PH-1 (CBS 123657, NRRL 31084) was sequenced on the Illumina HiSeq platform both as a single strain and as part of a pooled set of five F. graminearum strains (Table 1). Besides the newly sequenced strains, the whole genome sequencing reads of ten F. graminearum isolates were downloaded from the SRA database of NCBI that were produced by other research groups (Laurent et al., 2017 Wang et al., 2017). The outgroup, F. gerlachii strain was sequenced for an earlier publication (Kulik et al., 2016). A detailed description of the fungal strains is given in Table 1.

Species Strain Origin Host Year of isolation Sequenced individually or in a pool
F. graminearum CBS123657 (PH-1) NRRL31084 USA maize 1996 both
F. graminearum CBS119173 USA wheat head 2005 individually
F. graminearum CBS139513 Argentina barley 2011 individually
F. graminearum CBS139514 Argentina barley 2010 individually
F. graminearum CBS119799 South Africa wheat kernel 1987 individually
F. graminearum CBS119800 South Africa maize 1990 individually
F. graminearum CBS110263 Iran maize 1968 individually
F. graminearum CBS123688 Sweden oats unknown individually
F. graminearum CBS128539 Belgium wheat kernel 2007 individually
F. graminearum CBS138561 Poland wheat kernel 2010 individually
F. graminearum CBS138562 Poland wheat kernel 2010 individually
F. graminearum CBS138563 Poland wheat kernel 2003 individually
F. graminearum CBS104.09 unknown unknown 1909 individually
F. graminearum CBS185.32 unknown maize 1932 individually
F. graminearum CS3005 Australia barley 2001 individually
F. graminearum HN9-1 China wheat 2002 individually
F. graminearum HN-Z6 China wheat 2012 individually
F. graminearum INRA-156 France wheat 2001 individually
F. graminearum INRA-159 France wheat 2001 individually
F. graminearum INRA-164 France wheat 2002 individually
F. graminearum INRA-171 France wheat 2001 individually
F. graminearum INRA-181 France wheat 2002 individually
F. graminearum INRA-195 France wheat 2002 individually
F. graminearum YL-1 China wheat 2012 individually
F. graminearum bfb0999_1 China barley 2005 pooled
F. graminearum 68D2 Netherlands wheat 2001 pooled
F. graminearum CHG013 China maize 2005 pooled
F. graminearum CHG157 China barley 2005 pooled
F. gerlachii CBS123666 USA wheat head 2000 individually


Illumina MiSeq

Whole genome libraries were prepared using the Nextera XT kit (Illumina, San Diego, CA, USA) from gDNA extracted from mycelium. The constructed libraries were sequenced on the Illumina MiSeq platform with 250 bp paired-end read, version 2. The fungal genomes were sequenced in a multiplexed format (6–7 samples per run), where an oligonucleotide index barcode was embedded within adapter sequences that were ligated to DNA fragments (Smith et al., 2010). Next, the sequence reads were de-multiplexed and filtered for low quality base calls, trimming all bases from 5′ and 3′ read ends with Phred scores <Q30.

Illumina HiSeq

For F. graminearum strain PH-1 (CBS 123657, NRRL 31084) a random sheared shotgun library was prepared using the NEXTflex ChIP-seq Library prep kit with adaptations for low input gDNA according to the manufacturer’s protocol (Bioscientific). The library was loaded as (part of) one lane of an Illumina paired-end flowcell for cluster generation using a cBot. Sequencing was done on an Illumina HiSeq2000 instrument using 101, 7, 101 flow cycles for forward, index and reverse reads respectively. De-multiplexing of resulting data was carried out using the Casava 1.8 software. Sequencing reads have been uploaded to the European Nucleotide Archive (ENA) with the accession number PRJEB18592.

The same method was applied for the pooled sequencing with the adjustment that random sheared shotgun library was prepared by using equal amounts of genomic DNA extract from all five strains (Table 1). Sequencing reads have been uploaded to the European Nucleotide Archive (ENA) with the accession number PRJEB18596.

Third party sequencing data

Besides the sequencing data that we have generated, we also made use of sequencing data produced by other research groups that had been submitted to SRA (Sequencing Read Archive) databases. This included a dataset of SRA data of six strains isolated from France (PRJNA295638 Laurent et al., 2017), three strains from China (PRJNA296400 Wang et al., 2017) and one strain from Australia (PRJNA235346 Gardiner, Stiller & Kazan, 2014). The mitochondrial genome sequences for the strains sequenced by third parties are available in the Third Party Annotation Section of the DDBJ/ENA/GenBank databases under the accession numbers TPA: BK010538–BK010547


GRAbB was used with SPAdes assembler to reconstruct the mitogenome of the strains. GRAbB (Brankovics et al., 2016) was chosen because it is a wrapper program for iterative de novo assembly based on a reference sequence. SPAdes 3.8.1 (Bankevich et al., 2012 Nurk et al., 2013) assembler was used, since it offers good insight for the user into the relationship between nodes in the assembly graph and the relationship between nodes, contigs and scaffolds. The mitochondrial genomes were assembled from NGS reads using GRAbB by specifying the mitogenome sequence of PH-1 strain (HG970331) as query sequence.

For each individually sequenced strain it was possible to resolve the assembly to a single circular sequence. When the GRAbB run finished for the strains that were pooled for sequencing, the final assembly graph was visualized using Bandage (Wick et al., 2015) and the assembly was resolved to two circular sequence variants to capture all the variation within the dataset (Text S1). For the first circular sequence, referred to as “short”, the shorter alternative contigs were included in the path at each position where continuity was ambiguous. For the other sequence, referred to as “long”, the longer alternatives were included. In this way, all the different sequence regions were represented at least once in the two sequences.

Sequence annotation

The initial mitogenome annotations were done using MFannot ( and were manually improved: annotation of tRNA genes was improved using tRNAscan-SE (Pavesi et al., 1994), annotation of protein-coding genes and the rnl gene was corrected by aligning intronless homologs to the genome. Intron encoded proteins were identified using NCBI’s ORF Finder ( and annotated using InterPro (Mitchell et al., 2015) and CD-Search (Marchler-Bauer & Bryant, 2004). The annotated mitochondrial genome sequences are available under the following GenBank accession numbers: BK010538 –BK010547, KP966550 –KP966561, KR011238 and MH412632.

Read mapping and SNP discovery

The mitogenome of F. graminearum strain PH-1 and the two mitogenome sequences obtained from the assembly of the pooled dataset were used as reference sequences for the read mapping and SNP discovery. The read mapping was done using aln and sampe subcommands of the Burrows-Wheeler Alignment tool (BWA-0.7.12-r1034) (Li & Durbin, 2009). SNP calling was done using SAMtools mpileup (1.3.1) with -g and -f flag and BCFtools call (1.3.1) with -mv flag (Li et al., 2009).

Coverage analysis

Coverage of different regions was estimated by, first, mapping reads of the pooled dataset to the reference sequence using the sampe subcommand of the Burrows-Wheeler Alignment tool (BWA-0.7.12-r1034) (Li & Durbin, 2009). Then, read coverage was calculated using the genomecov command of bedtools v2.26.0. The following single copy nuclear protein coding genes were used to represent single copy nuclear regions: γ-actin (act), β-tubulin II (tub2), calmodulin (cal), 60S ribosomal protein L10 (rpl10a), the second largest subunit of DNA-dependent RNA polymerase II (rpb2), translation elongation factor 1α (tef1a), translation elongation factor 3 (tef3) and topoisomerase I (top1). The reference sequences were extracted from the genome of PH-1 (four chromosomes: HG970332, HG970333, HG970334, and HG970335). The nuclear mitochondrial DNA segment (NUMT) used for coverage comparison was identified during the assembly of the pooled data (see Text S1).

Intron validation

The RNA-seq data for F. graminearum PH-1 was downloaded from NCBI’s SRA database, accession number PRJNA239711 (Zhao et al., 2014). Read mapping was done by HISAT2 aligner (Kim, Langmead & Salzberg, 2015) by specifying putative intron positions. The intron position were validated based on the splice site output file and by examining the mapping SAM file produced by the aligner.

Linear model

R was used for linear model analysis to test whether the intron variation is the main reason of mitochondrial genome length variation within the species. The linear model was the following: y = x + c where y was the total length of the mitochondrial genome, x was the length of the intron sequences and c was the y-intercept (average intronless length of the mitochondrial genomes). The R 2 value obtained from linear model analysis specifies what percentage of the variation of the dependent value (mitogenome length) is explained by the variation in the independent value (intron length). R 2 = 1 − S S residual S S total Residual sums of squares (SSresidual) and total sums of squares (SStotal) were calculated using the deviance function of R.

Comparative sequence analysis

The nucleotide sequences were aligned using MUSCLE (Edgar, 2004a Edgar, 2004b). Sequence variability of given regions was calculated by aligning the sequences. Then the number of characters with multiple character states was calculated and divided by the total number of characters in the alignment. This step was done using fasta_variability from the fasta_tools package (

Detecting the presence of recombination

The intergenic regions were analyzed using the Φw-test implemented in SplitsTree (Bruen, Philippe & Bryant, 2006) to detect whether there is recombination in the mitochondrial genome.

Combinatorial prediction of marker panels from single-cell transcriptomic data

Single-cell transcriptomic studies are identifying novel cell populations with exciting functional roles in various in vivo contexts, but identification of succinct gene marker panels for such populations remains a challenge. In this work, we introduce COMET, a computational framework for the identification of candidate marker panels consisting of one or more genes for cell populations of interest identified with single-cell RNA-seq data. We show that COMET outperforms other methods for the identification of single-gene panels and enables, for the first time, prediction of multi-gene marker panels ranked by relevance. Staining by flow cytometry assay confirmed the accuracy of COMET's predictions in identifying marker panels for cellular subtypes, at both the single- and multi-gene levels, validating COMET's applicability and accuracy in predicting favorable marker panels from transcriptomic input. COMET is a general non-parametric statistical framework and can be used as-is on various high-throughput datasets in addition to single-cell RNA-sequencing data. COMET is available for use via a web interface ( or a stand-alone software package (

Keywords: cell types computational biology data analysis marker panel single-cell RNA-seq.

© 2019 The Authors. Published under the terms of the CC BY 4.0 license.

Conflict of interest statement

The authors declare that they have no conflict of interest.


Figure 1. The COMET framework objective and…

Figure 1. The COMET framework objective and output

Following the identification of a cell population…

Figure 2. Attributes and performance of the…

Figure 2. Attributes and performance of the COMET Algorithm

An illustration of the binarization…

An illustration of the binarization procedure applied by COMET to each gene in a cluster‐specific manner via the non‐parametric XL‐mHG test (preprint: Wagner, 2015a). For each gene, an expression threshold of maximal classification strength for the given cluster is annotated with the XL‐mHG test. The XL‐mHG P‐value measures the significance of the chosen threshold index. This threshold index is then matched to an expression cutoff which is used to binarize gene expression values.

The assessment and ranking of multi‐gene marker panels by COMET utilize matrix multiplications. Following the binarization of gene expression at the single‐gene level, the (i) true‐positive and (ii) false‐positive rates for all gene combinations considered can be derived from two matrix multiplications (Materials and Methods). Illustrated here is the matrix multiplication to annotate true‐positive rates for 2‐gene marker panels. The true‐positive and false‐positive values are then used to compute hypergeometric enrichment P‐values for all pairs.

Marker panel predictions by COMET align closely between the heuristic approach and exact computation. Results displayed are a representative example, computed from analysis COMET's performance when analyzing the follicular B‐cell cluster (Fig 5A). (C) Running time can be improved with a proper choice in heuristic core size to maintain accuracy of results. The number of missed combinations in the top 2,000 ranked combinations is plotted against the time of computation for the 2‐gene and 3‐gene cases for a variety of heuristic core sizes to determine accuracy versus runtime. The leveling off of the number of missed combinations provides a good place to set the heuristic core size for best speed‐up and accuracy COMET's current default is 50. (D) The COMET generated rankings of each of the top 2,000 combinations for 2‐gene (left) and 3‐gene (right) panels are plotted against each combination's ranking from COMET's heuristic approach when using different sizes for the gene set heuristic core (Materials and Methods). At a core size of 25 (using the top‐ranking 25 single genes as the heuristic core) or larger, results align very closely between the heuristic and exact approaches.

Figure 3. COMET accurately and efficiently computes…

Figure 3. COMET accurately and efficiently computes marker panels for cell populations

The XL‐mHG test outperforms various differential expression tests in identifying favorable marker genes to be used as markers from simulated datasets (A, Materials and Methods), with respect to both robustness to small effect sizes (B, left) and sensitivity to sample size (B, right). B, left: When varying the magnitude of the difference between the means of the expression distributions for the cluster of interest (K) compared to the background (C) (termed here “effect size”, see illustration in (A)), common DE tests drop below 0.05 significance level at small effect sizes (of approximately 0.4), while the XL‐mHG test reaches significance only at approximately 3.6. Identification of favorable marker genes requires achieving satisfactory sensitivity and specificity rates which would not be achievable in cases of small effect sizes (due to the large overlap across the compared distributions). Hence, the XL‐mHG test performs better than commonly used DE tests in that it does not assign significant P‐values to genes that are differentially expressed but would be poor markers due to small effect sizes. B, right: When varying the total number of cells simulated in clusters K and C (termed here “sample size”, see illustration in (A)) for a fixed and small effect size of 1, common DE tests pick up the small difference in expression as significant once the sample sizes become large (and the detection power increases), while the XL‐mHG test does not reach significance and would not consider such genes as potential markers. The small effect size in this example simulates a poor marker for which desirable sensitivity and specificity rates could not be achieved, and this is controlled in the XL‐mHG test by the X and L parameters.

The XL‐mHG test outperforms logistic regression and tree ensemble classifiers (including random forest and extra trees) in identifying favorable genes to be used as markers from simulated datasets (noisy Poisson–Gamma generative model, see Materials and Methods). The scaled sum of ranks (SSR) metric indicates the ability of a method to rank highly good marker genes, with a value of SSR = 1 indicating optimal ranking.

COMET accurately identifies established markers for cell subpopulations in mouse spleen. Shown are the rankings of established marker genes for immune populations generated by different methods used for single‐gene marker identification. Data are taken from the spleen tissue of the MCA (Han et al, 2018).

Figure 4. COMET identifies favorable markers for…

Figure 4. COMET identifies favorable markers for splenic B cells

COMET outputs for the splenic B‐cell population from the MCA dataset (Han et al, 2018). (A) COMET output of the top 10 ranked candidate marker genes. (B, C) COMET plots the expression of a gene across all cells (right) and the binarized values of gene expression following binarization (red: expressed blue: not expressed) by the XL‐mHG threshold (left). Shown are COMET visualization outputs for CD19 (B) and Ly‐6D, CD20, and CD79b (C).

Flow cytometry analysis comparing the protein level staining of CD19, an established marker for B cells (Nadler et al, 1983), with three top‐ranking marker genes in the COMET output confirms that COMET's top‐ranking candidate markers are favorable for flow cytometry staining of B cells. The genes to validate were selected based on availability of trustable antibodies (SP = single positive). Bars and error bars indicate the mean and standard deviation. ****P < 0.0001 n = 4 biological replicates unpaired, two‐tailed t‐test.

Figure 5. COMET identifies favorable markers for…

Figure 5. COMET identifies favorable markers for splenic follicular B cells

Clustering and t‐SNE visualization…

Clustering and t‐SNE visualization of splenic B cells as generated by Tabula Muris (Tabula Muris Consortium, 2018).

Expression of follicular B‐cell marker CD23 and marginal zone B‐cell marker CD21 in the splenic B‐cell dataset from Tabula Muris as visualized by COMET.

Expression of the three top‐ranking markers for follicular B‐cell output by COMET, CD55, CD62L, and CXCR4, as visualized by COMET.

Flow cytometry analysis comparing the protein level staining of CD23, an established marker for follicular B cells, with the three top‐ranking marker genes in the COMET output confirms that COMET's top‐ranking candidate markers are favorable for flow cytometry staining of the follicular B‐cell subtype. The genes to validate were selected based on availability of trustable antibodies (SP = single positive). Bars and error bars indicate the mean and standard deviation. ****P < 0.0001 n = 6 biological replicates unpaired, two‐tailed t‐test.

Figure 6. COMET identifies favorable multi‐gene marker…

Figure 6. COMET identifies favorable multi‐gene marker panels for splenic follicular B cells

COMET outputs for two highly ranked 2‐gene marker panels predicted by COMET to isolate the splenic follicular B‐cell population, based on analysis of the Tabula Muris dataset (Tabula Muris Consortium, 2018). Shown are binarized values of gene expression following binarization by the XL‐mHG threshold for each gene separately (left, middle) and when using both genes combined (right).

Flow cytometry staining for the marker combinations CD62L + CD44 − (C) and CD55 + CD62L + (D) confirms that COMET's candidate multi‐gene marker panels are favorable for flow cytometry staining of splenic follicular B cells. Both marker combinations included a significantly higher frequency of follicular B cells (CD23 + ) and a lower frequency of other B‐cell subpopulations (DN and CD21 + ) than the single staining for CD62L + and CD55 + , respectively. (DN = double negative). The marker combinations were selected based on availability of established antibodies. Bars and error bars indicate the mean and standard deviation. *P < 0.05 **P < 0.01 ***P < 0.001 n = 6 biological replicates unpaired, two‐tailed t‐test.


Using mitochondrial DNA ND1 markers, the genetic structure, origin, and invasion history of B. dorsalis s.s. were investigated. It was observed that distinct lineages (both minor and major) originated from specific southeast Asian populations. Interestingly, minor lineages have not spread in China. Evidence was found indicating symmetrical migration from southeast Asia to China. Understanding origin and genetic structure of B. dorsalis s.s. will possibly assist in the development of effective management strategies to prevent biological invasion. Source-tracking and minor distinct lineage “encounter” approaches may also provide better clues to the design of appropriate control methods, such as introducing natural enemies, to minimize biological invasion of B. dorsalis s.s. in China.

1. Introduction

Cystic echinococcosis (hydatid disease) is an important and globally distributed parasitic zoonosis caused by the larval stage of the cestode parasite Echinococcus granulosus complex [1]. Intermediate hosts, which include humans, sheep, goats, cattle, yak, camels, and other wild mammals, become infected by ingesting the parasite's eggs from infected carnivores (the definitive hosts). Subsequently, a larval stage (metacestode) develops as a cyst in the internal organs (mainly in liver and lungs) of the intermediate host.

The causative agent of cystic echinococcosis was traditionally regarded to be a single species, E. granulosus. However, recent research has shown that E. granulosus is a species complex consisting of several taxa that differ in adult morphology, their preferences for intermediate hosts, and their pathogenicity to animals and humans [2]. But now, this species complex is differentiated into ten genotypes (G1–G10) [3𠄸]. Moreover, some researchers have suggested that E. granulosus should be classified as four species based on the substantial molecular differences in both mitochondrial and nuclear DNA genes: E. granulosus sensu stricto (genotypes G1–G3), E. equinus (genotype G4), E. ortleppi (genotype G5), and E. canadensis (genotypes G6–G10) [9], though the status of E. canadensis is still disputed [9�]. Meanwhile, a new independent taxon named E. felidis (lion strain) was isolated from South Africa [13].

In China, cystic echinococcosis has been reported in more than twenty provinces and is particularly prevalent [14, 15]. However, to date, infections have been ascribed to just two E. granulosus genotypes G1 (a sheep strain) and G6 (a camel strain) [16] Southwest China is one of the most serious areas of E. granulosus infections in China. The past geologic events and climate fluctuations lead to a high biodiversity of species in this area [17, 18]. In addition, E. shiquicus, a new species of Echinococcus, has been recently discovered in this region [19]. Recently, the first human CE case infected with G5 genotype (cattle strain) in Asia has been reported [20]. For these reasons, it is critical to understand the genetic composition and structure of the E. granulosus complex in this region. In this study, we provide the first investigation of the molecular diagnostics of cystic echinococcosis infections in Southwest China.

Mitochondrial DNA has been widely used in population genetics to elucidate phylogenies, as it experiences high mutation and low recombination rates and thus best reflects population genetic structure, population differentiation, and species relationships [21]. The NADH dehydrogenase subunit 2 gene (ND2 gene) evolves faster than other mitochondrial genes and is widely applied in molecular systematics and population genetics studies [22�]. We used the ND2 gene as a genetic marker to investigate the genetic diversity and structure of Echinococcus granulosus within Southwest China. This information will be essential for further studies investigating the biology and transmission dynamics of these parasites, especially to humans, and will underpin research on the diagnosis, control, and prevention of this disease [2, 26�].


Sites JW, Marshall JC: Delimiting species: a Renaissance issue in systematic biology. Trends in Ecology & Evolution. 2003, 18: 462-470. 10.1016/S0169-5347(03)00184-8.

Rissler LJ, Apodaca JJ: Adding more ecology into species delimitation: ecological niche models and phylogeography help define cryptic species in the black salamander (Aneides flavipunctatus). Systematic Biology. 2007, 56: 924-942. 10.1080/10635150701703063.

Wiens JJ: Species delimitation: new approaches for discovering diversity. Systematic Biology. 2007, 56: 875-878. 10.1080/10635150701748506.

Savage JM: Systematics and the biodiversity crisis. BioScience. 1995, 45: 673-679. 10.2307/1312672.

Pfenninger M, Schwenk K: Cryptic animal species are homogeneously distributed among taxa and biogeographical regions. BMC Evolutionary Biology. 2007, 7: 121-10.1186/1471-2148-7-121.

Bickford D, Lohman DJ, Sodhi NS, Ng PKL, Meier R, Winker K, Ingram KK, Das I: Cryptic species as a window on diversity and conservation. Trends in Ecology & Evolution. 2007, 22: 148-155. 10.1016/j.tree.2006.11.004.

de Queiroz K: The general lineage concept of species, species criteria, and the process of speciation: A conceptual unification and terminological recommendations. Endless forms: Species and speciation. Edited by: Howard DJ, Berlocher SH. 1998, New York: Oxford University Press, 57-75.

Janzen DH: Now is the time. Philosophical Transaction of the Royal Society of London Series B. 2004, 359: 731-732. 10.1098/rstb.2003.1444.

de Queiroz K: Species concepts and species delimitation. Systematic Biology. 2007, 56: 879-886. 10.1080/10635150701701083.

Funk DJ, Omland KE: Species-level paraphyly and polyphyly: frequency, causes, and consequences, with insights from animal mitochondrial DNA. Annual Review of Ecology, Evolution, and Systematics. 2003, 34: 397-423. 10.1146/annurev.ecolsys.34.011802.132421.

Doyle JJ: The irrelevance of allele tree topologies for species delimitation, and a non-topological alternative. Systematic Botany. 1995, 20: 574-588. 10.2307/2419811.

Carson H: The species as a field for recombination. The species problem. Edited by: Mayr E. 1957, Washington: American Association for the Advancement of Science, 23-38.

Miller JT, Spooner DM: Collapse of species boundaries in the wild potato Solanum brevicaule complex (Solanaceae, S. sect. Petota): molecular data. Plant Systematics and Evolution. 1999, 214: 103-130. 10.1007/BF00985734.

Marshall JC, Arévalo E, Benavides E, Sites JL, Sites JW: Delimiting species: comparing methods for Mendelian characters using lizards of the Sceloporus grammicus (Squamata: Phrynosomatidae) complex. Evolution. 2006, 60: 1050-1065.

Hausdorf B, Hennig C: Species delimitation using dominant and codominant multilocus markers. Systematic Biology. 2010, 59: 491-503. 10.1093/sysbio/syq039.

Flot J-F, Tillier A, Samadi S, Tillier S: Phase determination from direct sequencing of length-variable DNA regions. Molecular Ecology Notes. 2006, 6: 627-630. 10.1111/j.1471-8286.2006.01355.x.

Harrigan RJ, Mazza ME, Sorenson MD: Computation vs. cloning: evaluation of two methods for haplotype determination. Molecular Ecology Resources. 2008, 8: 1239-1248. 10.1111/j.1755-0998.2008.02241.x.

Palumbi S, Baker C: Contrasting population structure from nuclear intron sequences and mtDNA of humpback whales. Molecular Biology and Evolution. 1994, 11: 426-435.

Miyamoto MM, Fitch WM: Testing species phylogenies and phylogenetic methods with congruence. Systematic Biology. 1995, 44: 64-76.

Li B, Lecointre G: Formalizing reliability in the taxonomic congruence approach. Zoologica Scripta. 2009, 38: 101-112. 10.1111/j.1463-6409.2008.00361.x.

Forsman ZH, Hunter CL, Fox GE, Wellington GM: Is the ITS region the solution to the 'species problem' in corals? Intragenomic variation and alignment permutation in Porites, Siderastrea and outgroup taxa. Proceedings of the 10th International Coral Reef Symposium. 2006, 1: 14-23.

Flot J-F, Tillier S: The mitochondrial genome of Pocillopora (Cnidaria: Scleractinia) contains two variable regions: The putative D-loop and a novel ORF of unknown function. Gene. 2007, 401: 80-87. 10.1016/j.gene.2007.07.006.

Posada D, Crandall KA: Intraspecific gene genealogies: trees grafting into networks. Trends in Ecology and Evolution. 2001, 16: 37-45. 10.1016/S0169-5347(00)02026-7.

Knowles LL, Carstens BC: Delimiting species without monophyletic gene trees. Systematic Biology. 2007, 56: 887-895. 10.1080/10635150701701091.

O'Meara BC: New heuristic methods for joint species delimitation and species tree inference. Systematic Biology. 2010, 59: 59-73. 10.1093/sysbio/syp077.

Flot J-F, Magalon H, Cruaud C, Couloux A, Tillier S: Patterns of genetic structure among Hawaiian corals of the genus Pocillopora yield clusters of individuals that are compatible with morphology. Comptes Rendus Biologies. 2008, 331: 239-247. 10.1016/j.crvi.2007.12.003.

Lopez JV, Yuhki N, Masuda R, Modi W, O'Brien SJ: Numt, a recent transfer and tandem amplification of mitochondrial DNA to the nuclear genome of the domestic cat. Journal of Molecular Evolution. 1994, 39: 174-190.

Sorenson MD, Fleischer RC: Multiple independent transpositions of mitochondrial DNA control region sequences to the nucleus. Proceedings of the National Academy of Sciences of the United States of America. 1996, 93: 15239-15243. 10.1073/pnas.93.26.15239.

Bensasson D, Zhang D-X, Hartl DL, Hewitt GM: Mitochondrial pseudogenes: evolution's misplaced witnesses. Trends in Ecology & Evolution. 2001, 16: 314-321. 10.1016/S0169-5347(01)02151-6.

Williams ST, Knowlton N: Mitochondrial pseudogenes are pervasive and often insidious in the snapping shrimp genus Alpheus. Molecular Biology and Evolution. 2001, 18: 1484-1493.

Richly E, Leister D: NUMTs in sequenced eukaryotic genomes. Molecular Biology and Evolution. 2004, 21: 1081-1084. 10.1093/molbev/msh110.

Schmitz J, Piskurek O, Zischler H: Forty million years of independent evolution: a mitochondrial gene and its corresponding nuclear pseudogene in primates. Journal of Molecular Evolution. 2005, 61: 1-11. 10.1007/s00239-004-0293-3.

Ibarguchi G, Friesen VL, Lougheed SC: Defeating numts: Semi-pure mitochondrial DNA from eggs and simple purification methods for field-collected wildlife tissues. Genome. 2006, 49: 1438-1450. 10.1139/G06-107.

Combosch DJ, Guzman HM, Schuhmacher H, Vollmer SV: Interspecific hybridization and restricted trans-Pacific gene flow in the Tropical Eastern Pacific Pocillopora. Molecular Ecology. 2008, 17: 1304-1312. 10.1111/j.1365-294X.2007.03672.x.

Veron JEN, Stafford-Smith M: Corals of the world. 2000, Australian Institute of Marine Science

Kluge AG: A concern for evidence and a phylogenetic hypothesis of relationships among Epicrates (Boidae, Serpentes). Systematic Biology. 1989, 39: 7-25.

Kluge AG: Total evidence or taxonomic congruence: cladistics or consensus classification. Cladistics. 1998, 14: 151-158. 10.1111/j.1096-0031.1998.tb00328.x.

Sanderson MJ, Purvis A, Henze C: Phylogenetic supertrees: assembling the trees of life. Trends in Ecology & Evolution. 1998, 13: 105-109. 10.1016/S0169-5347(97)01242-1.

Wiens JJ: Missing data, incomplete taxa, and phylogenetic accuracy. Systematic Biology. 2003, 52: 528-538. 10.1080/10635150390218330.

Sargent TD, Jamrich M, Dawid IB: Cell interactions and the control of gene activity during early development of Xenopus laevis. Developmental Biology. 1986, 114: 238-246. 10.1016/0012-1606(86)90399-4.

Fukami H, Budd AF, Levitan DR, Jara J, Kersanach R, Knowlton N: Geographic differences in species boundaries among members of the Montastraea annularis complex based on molecular and morphological markers. Evolution. 2004, 38: 324-337.

Flot J-F, Tillier S: Molecular phylogeny and systematics of the scleractinian coral genus Pocillopora in Hawaii. Proceedings of the 10th International Coral Reef Symposium. 2006, 1: 24-29.

Creer S, Malhotra A, Thorpe RS, Pook CE: Targeting optimal introns for phylogenetic analyses in non-model taxa: experimental results in Asian pitvipers. Cladistics. 2005, 21: 390-395. 10.1111/j.1096-0031.2005.00072.x.

Flot J-F: Champuru 1.0: a computer software for unraveling mixtures of two DNA sequences of unequal lengths. Molecular Ecology Notes. 2007, 7: 974-977. 10.1111/j.1471-8286.2007.01857.x.

Flot J-F: SeqPHASE: a web tool for interconverting PHASE input/output files and FASTA sequence alignments. Molecular Ecology Resources. 2010, 10: 162-166. 10.1111/j.1755-0998.2009.02732.x.

Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction from population data. The American Journal of Human Genetics. 2001, 68: 978-989. 10.1086/319501.

Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular Biology and Evolution. 2007, 24: 1596-1599. 10.1093/molbev/msm092.

Librado P, Rozas J: DnaSP v5: a software for comprehensive analysis of DNA polymorphism data. Bioinformatics. 2009, 25: 1451-1452. 10.1093/bioinformatics/btp187.

Bandelt HJ, Forster P, Röhl A: Median-joining networks for inferring intraspecific phylogenies. Molecular Biology and Evolution. 1999, 16: 37-48.

Hertlein LG, Emerson WK: Additional notes on the invertebrate fauna of Clipperton Island. American Museum novitates. 1957, 1859: 1-9.

Glynn PW, Veron JEN, Wellington GM: Clipperton Atoll (eastern Pacific): oceanography, geomorphology, reef-building coral ecology and biogeography. Coral Reefs. 1996, 15: 71-99.

Carricart-Ganivet JP, Reyes-Bonilla H: New and previous records of scleractinian corals from Clipperton Atoll, eastern Pacific. Pacific Science. 1999, 53: 370-375.

Flot J-F, Adjeroud M: Les coraux. Clipperton, environnement et biodiversité d'un microcosme océanique. Edited by: Charpy L. 2009, Paris, Marseille: Muséum national d'Histoire naturelle, IRD, 155-162.

Flot J-F, Licuanan W, Nakano Y, Payri C, Cruaud C, Tillier S: Mitochondrial sequences of Seriatopora corals show little agreement with morphology and reveal the duplication of a tRNA gene near the control region. Coral Reefs. 2008, 27: 789-794. 10.1007/s00338-008-0407-2.

Author information


Institute of Molecular and Cell Biology, Agency for Science, Technology and Research, Singapore, 138673, Singapore

Patricia J. Ahl, Richard A. Hopkins, Wen Wei Xiang, Bijin Au, Nivashini Kaliaperumal, Anna-Marie Fairhurst & John E. Connolly

Department of Microbiology and Immunology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 117545, Singapore

Patricia J. Ahl & John E. Connolly

Tessa Therapeutics Pte Ltd, Institute of Molecular and Cell Biology, Agency for Science, Technology and Research, Singapore, 138673, Singapore

Ancient Horse DNA Reveals Gene Flow Between North American and Eurasian Horses

A new study of ancient DNA from horse fossils found in North America and Eurasia shows that horse populations on the two continents remained connected through the Bering Land Bridge, moving back and forth and interbreeding multiple times over hundreds of thousands of years.

The new findings demonstrate the genetic continuity between the horses that died out in North America at the end of the last ice age and the horses that were eventually domesticated in Eurasia and later reintroduced to North America by Europeans. The study has been accepted for publication in the journal Molecular Ecology and is currently available online.

Paleontologist Aisling Farrell holds a mummified frozen horse limb recovered from a placer gold mine in the Klondike goldfields in Yukon Territory, Canada. Ancient DNA recovered from horse fossils reveals gene flow between horse populations in North America and Eurasia. Credit: Government of Yukon

“The results of this paper show that DNA flowed readily between Asia and North America during the ice ages, maintaining physical and evolutionary connectivity between horse populations across the Northern Hemisphere,” said corresponding author Beth Shapiro, professor of ecology and evolutionary biology at UC Santa Cruz and a Howard Hughes Medical Institute investigator.

The study highlights the importance of the Bering Land Bridge as an ecological corridor for the movement of large animals between the continents during the Pleistocene, when massive ice sheets formed during glacial periods. Dramatically lower sea levels uncovered a vast land area known as Beringia, extending from the Lena River in Russia to the MacKenzie River in Canada, with extensive grasslands supporting populations of horses, mammoths, bison, and other Pleistocene fauna.

Paleontologists have long known that horses evolved and diversified in North America. One lineage of horses, known as the caballine horses (which includes domestic horses) dispersed into Eurasia over the Bering Land Bridge about 1 million years ago, and the Eurasian population then began to diverge genetically from the horses that remained in North America.

The new study shows that after the split, there were at least two periods when horses moved back and forth between the continents and interbred, so that the genomes of North American horses acquired segments of Eurasian DNA and vice versa.

“This is the first comprehensive look at the genetics of ancient horse populations across both continents,” said first author Alisa Vershinina, a postdoctoral scholar working in Shapiro’s Paleogenomics Laboratory at UC Santa Cruz. “With data from mitochondrial and nuclear genomes, we were able to see that horses were not only dispersing between the continents, but they were also interbreeding and exchanging genes.”

Mitochondrial DNA, inherited only from the mother, is useful for studying evolutionary relationships because it accumulates mutations at a steady rate. It is also easier to recover from fossils because it is a small genome and there are many copies in every cell. The nuclear genome carried by the chromosomes, however, is a much richer source of evolutionary information.

Alisa Vershinina works in the Paleogenomics Lab at UC Santa Cruz where ancient DNA is extracted from fossils for sequencing and analysis. Credit: UC Santa Cruz

The researchers sequenced 78 new mitochondrial genomes from ancient horses found across Eurasia and North America. Combining those with 112 previously published mitochondrial genomes, the researchers reconstructed a phylogenetic tree, a branching diagram showing how all the samples were related. With a location and an approximate date for each genome, they could track the movements of different lineages of ancient horses.

“We found Eurasian horse lineages here in North America and vice versa, suggesting cross-continental population movements. With dated mitochondrial genomes we can see when that shift in location happened,” Vershinina explained.

The analysis showed two periods of dispersal between the continents, both coinciding with periods when the Bering Land Bridge would have been open. In the Middle Pleistocene, shortly after the two lineages diverged, the movement was mostly east to west. A second period in the Late Pleistocene saw movement in both directions, but mostly west to east. Due to limited sampling in some periods, the data may fail to capture other dispersal events, the researchers said.

The team also sequenced two new nuclear genomes from well-preserved horse fossils recovered in Yukon Territory, Canada. These were combined with 7 previously published nuclear genomes, enabling the researchers to quantify the amount of gene flow between the Eurasian and North American populations.

“The usual view in the past was that horses differentiated into separate species as soon as they were in Asia, but these results show there was continuity between the populations,” said coauthor Ross MacPhee, a paleontologist at the American Museum of Natural History. “They were able to interbreed freely, and we see the results of that in the genomes of fossils from either side of the divide.”

The new findings are sure to fuel the ongoing controversy over the management of wild horses in the United States, descendants of domestic horses brought over by Europeans. Many people regard those wild horses as an invasive species, while others consider them to be part of the native fauna of North America.

“Horses persisted in North America for a long time, and they occupied an ecological niche here,” Vershinina said. “They died out about 11,000 years ago, but that’s not much time in evolutionary terms. Present-day wild North American horses could be considered reintroduced, rather than invasive.”

Coauthor Grant Zazula, a paleontologist with the Government of Yukon, said the new findings help reframe the question of why horses disappeared from North America. “It was a regional population loss rather than an extinction,” he said. “We still don’t know why, but it tells us that conditions in North America were dramatically different at the end of the last ice age. If horses hadn’t crossed over to Asia, we would have lost them all globally.”

Reference: “Ancient horse genomes reveal the timing and extent of dispersals across the Bering Land Bridge” by Alisa O. Vershinina, Peter D. Heintzman, Duane G. Froese, Grant Zazula, Molly Cassatt-Johnstone, Love Dalén, Clio Der Sarkissian, Shelby G. Dunn, Luca Ermini, Cristina Gamba, Pamela Groves, Joshua D. Kapp, Daniel H. Mann, Andaine Seguin-Orlando, John Southon, Mathias Stiller, Matthew J. Wooller, Gennady Baryshnikov, Dmitry Gimranov, Eric Scott, Elizabeth Hall, Susan Hewitson, Irina Kirillova, Pavel Kosintsev, Fedor Shidlovsky, Hao-Wen Tong, Mikhail P. Tiunov, Sergey Vartanyan, Ludovic Orlando, Russell Corbett-Detig, Ross D. MacPhee and Beth Shapiro, 10 May 2021, Molecular Ecology.
DOI: 10.1111/mec.15977