Collective term for both exons and introns

Collective term for both exons and introns

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Is there a term I can use to refer collectively to both exons and introns? By collectively, I don't mean ligated as with an unprocessed transcription product. I'm just writing about exons and introns and getting frustrated that I keep on having to write "exons and introns", wishing there was a more concise term I can use in the place of that phrase.

For example, the term I am looking for would fill in the following blank perfectly.

The gene has 5 exons and 4 introns, so it has 9 _____.

How about "splicing fragments"? It might be easier to refer to them according to the mechanism of production.

I think your best bet would be transcript. Gene is actually quite close but most of the current definitions include the promoter sequence in the gene so that is not restricted to introns and exons.

Transcripts, on the other hand, are all of a gene's sequence that is transcribed and consist exclusively of introns and exons (UTRs are also exons and are actually spliced in some cases). Note that I am using transcript and not mRNA since mature mRNAs do not of course contain introns. Immature ones do and transcript can apply.

RNA Processing

In the appropriate cell type and at the correct developmental stage, ribonucleic acid (RNA) polymerase transcribes an RNA copy of a gene, the primary transcript. However, the primary transcript may contain many more nucleotides than are needed to create the intended protein. In addition, the primary transcript is vulnerable to breakdown by RNA-degrading enzymes.

Before the primary transcript can be used to guide protein synthesis, it must be processed into a mature transcript, called messenger RNA (mRNA). This is especially true in eukaryotic cells . Processing events include protection of both ends of the transcript and removal of intervening nonprotein-coding regions.

β -thalassemia, a hemoglobin disease, can be caused by an intron mutation that prevents recognition of a splice site.

On an RNA molecule, the end formed earliest is known as the 5 ′ (5-prime) end, whereas the trailing end is the 3 ′ end. The ends of the primary transcript are particularly susceptible to a class of degradative enzymes called exonucleases. During processing, the 5 ′ end of the primary transcript is protected against the effects of these enzymes by the addition of a CAP. The CAP uses an unusual linkage between nucleotides. Exonucleases do not recognize this unusual structure and therefore cannot remove the CAP. Since exonucleases work only from an end, if the CAP nucleotide cannot be removed, the entire 5 ′ end of the mRNA is protected. The 5 ′ CAP also aids in transport out of the nucleus and helps bind the mRNA to the ribosome .

To protect the 3 ′ end against degradative exonucleases, a poly-A tail is added by poly-A polymerase. Poly-A is a chain of adenine nucleotides, one hundred to two hundred units long. The poly-A tail has typical bonds that are susceptible to degradation by exonucleases, but it does not have any protein coding function so it does not particularly matter if some of the A residues are degraded. It actually takes quite some time for the poly-A tail to be completely lost, and during this time the protein coding portion of the mRNA remains intact. Without the poly-A tail, however, the exonucleases would rapidly degrade into the protein coding portion of the mRNA. An exception to the poly-A strategy is seen in the mRNA for histones, proteins that wrap deoxyribonucleic acid (DNA) into chromosomes. Instead of poly-A, histone mRNA uses a much smaller structure that is regulated by factors present during DNA synthesis.

The most striking event in RNA processing occurs because the protein coding region in eukaryotic genes is not continuous. A typical eukaryotic gene is composed of a number of protein coding regions, called exons, that are separated by noncoding regions called introns. In fact, the number of nucleotides in the introns can be much larger than the number of nucleotides in the combined exons. The DNA gene contains the code for both the exons and the introns, as does the primary RNA transcript, but the noncoding intron sequences must be removed to form the mRNA before protein synthesis.

The process by which introns are removed and exons are joined to one another is called RNA splicing, and it is catalyzed by complexes of proteins and RNA called SNuRPs (small nuclear ribonucleoprotein particles). These complexes locate special RNA sequences that flank the exon/intron junctions, bind to them, and catalyze the splicing reactions. Some primary transcripts can be spliced in a few different ways. Such "alternate splicing" yields a range of related proteins.

After addition of the CAP to the 5 ′ end, the poly-A tail to the 3 ′ end, and splicing of the introns, the processing is complete and the mRNA is transported through nuclear pores to the cytoplasm of the eukaryotic cell where translation (protein synthesis) will occur.

see also Gene Nuclear Transport Protein Synthesis RNA Transfer RNA Transcription


Hello. Not a domain expert here, but should the first sentence of this article say not "sequence encoded by a gene" but "sequence encoding a gene"?

I think that the ORF is the sequence that is between a start and a stop codon.

The current Wikipedia entrance on Exons claims that exons are INCORRECTLY referred to stretches of DNA that do correspond to coding DNA. The reasoning behind this is a reference to a paper that discusses this types of exons. One paper does not merit to change a definition of something that is found in all textbooks and all glossary as: Coding sequence of DNA present in mature messenger RNA. EXpressing regiON. This whole entrance is INCORRECT and should be changed.

Regarding the remark above this one: if you refer to a mRNA you are correct, if you refer to hnRNA or the gene you are wrong since these might contain INTeRvening regiONs

Glossary definition of exon, from the Lodish textbook: Segment of a eukaryotic gene (or of its primary transcript) that reaches the cytoplasm as part of a mature mRNA, rRNA, or tRNA molecule. As far as I know, everybody in genomics recognizes that exons need not contain protein-coding sequence 3' and 5' UTR exons are found in most eukaryotes, and there are even entirely non-protein-coding transcripts that undergo splicing. Exons that do contain protein-coding sequence are referred to as CDS exons when the distinction is important. --Mike Lin 21:20, 11 July 2006 (UTC)

The Wikipedia entry is correct not all exons code for proteins. Although a search of online definitions [1] finds many entries that imply exons must code for proteins. Most will, although a significant amount of exons code for UTRs see the entry on 3'UTRs.

TransControl (talk) —Preceding comment was added at 03:21, 22 February 2008 (UTC)

I've been seeing a lot of "exons do not have to code for protein" in the literature: but no-where have I seen it explained clearly. Is it simply that portions of the mature mRNA (poly-A sites and UTR's, regulatory sequences(?) that sort of thing) which are not represented as protein but are, again, present in the mature mRNA, are exons? Geno-Supremo (talk) 16:23, 2 February 2009 (UTC)

Got this from the Complementary DNA entry:

The central dogma of molecular biology outlines that in synthesizing proteins, DNA is transcribed into mRNA, which is translated into protein. One difference between eukaryotic and prokaryotic mRNA is that eukaryotic mRNA can contain introns (intervening sequences), which are not coding sequences, per se, and must be spliced out of the mRNA before it is translated into protein. Prokaryotic mRNA has no introns, so it is not subject to splicing.

And this from the Exon Entry:

An exon is any region of DNA within a gene that is transcribed to the final messenger RNA (mRNA) molecule, rather than being spliced out from the transcribed RNA molecule. Exons of many eukaryotic genes interleave with segments of non-coding DNA (introns). The term exon was coined by American biochemist Walter Gilbert in 1978:

So we have "eukaryotic mRNA can contain introns" and "An exon is any region of DNA within a gene that is transcribed to the final messenger RNA (mRNA)". Those can't both be true.

I'm guessing that the statement under Exon is supposed to be "eukaryotic RNA can contain introns". Not mRNA. But I don't know from biology. :) —The preceding unsigned comment was added by (talk) 20:13, 6 March 2007 (UTC).

The current opening is confusing:

"An exon is any region of DNA within a gene that is transcribed to the final messenger RNA (mRNA) molecule, rather than being spliced out from the transcribed RNA molecule. Exons of many eukaryotic genes interleave with segments of non-coding DNA (introns)"

An exon is the region of a transcribed gene present in the final functional RNA molecule, whereas intragenic introns are spliced out.

Note: a) part of every mature mRNA is non-coding UTR but exonic. As in the diagram and text. b) other RNAs, tRNAs etc have introns c) However, introns, by definition do not code for proteins. (At least in the mRNAs they are removed from.) d) There is some lack of clarity as to whether the exon is part of the gene (DNA) or mRNA, or both.

This prominent Molecular Biology online text book has a definition similar to that proposed, but it is more complex:

TransControl (talk) —Preceding comment was added at 04:23, 20 February 2008 (UTC)

Exon - one of the officers rank in the Yeomen of the Guard.

The first mention of Exon is in the ceremony of All Nights, which is fully described in the chapter relating to Charles II. They were added to the staff of officers in 1668 just about the time when Marsham’s account of All Night was written.

The derivation and meaning of the word Exon has been and is a puzzle to many, but it is undoubtedly the French pronunciation of the word exempt. An exempt was an officer in the old French Garde Du Corps. “Exempts des Guedes du Corps” are described in a military dictionary as “Exons belonging to the Body Guards,” There was in France an officer of police called “Un Exempt (exon) de Police.”

When Charles II formed his Horse Guards he created a commissioned officer who was styled indiscriminately the exempt or the Exon, and in each of the two troops this officer ranked with the Captain. There is further confusion connected with the title of Exon, for in his commission he is styled corporal. But it appears that in Elizabeth’s reign “corporal” was a commissioned officer, and the term was synonymous with Captain.

Down to the time of the Coronation of George III, which took place on 22 September 1761, corporal was only another word for Exon, as may be seen on referring to the official programme of the Coronation, wherein mention is made of “the Corporals or Exons of the Yeomen of the Guard.” The exempt in the French Garde du corps always had charge of the Night Watch, and the Exon is the English Body Guard was especially appointed for that service.

Curiously enough the word Exempt is also used in the orders of the Yeomen of the Guard with its English meaning.

Edited from A History of the Yeomen of the Guard 1485 - 1885 —Preceding unsigned comment added by (talk) 22:50, 11 December 2008 (UTC)

While the quote does say the term derives from expressed region if you read it carefully, I specified it directly since I think it may not be obvious if you don't read carefully (I didn't actually notice it at first). If there's any disagreement that exon derives from expressed region, [2] supports the claim Nil Einne (talk) 12:52, 30 September 2010 (UTC)

I'm surprised that there isn't a link to operon here? I don't know about the two to know if it should definitely be mentioned in this article or not (my very-lay-understanding (*) is that an operon is a general term for the collection of several introns, exons, promotor sequence(s), etc involved in a single gene regulation subsystem. Rather like the equivalent of a single file from a 10-million-file-30-billion-line computer program? (to take a perhaps intelligent design POV - not that I know any more about the "modern" (non-religious, ? sic) ID movement)

(*) I went to high school starting in 1994, when only the best schools mentioned even alternative splicing or introns. (I remember feeling quite proud of myself about making a set of lecture notes for my bio teacher that went into "detail" (sic).. How naive - I should make her a new set of notes that really goes into detail - introns, t-, m-, r- and other kinds of rna, psedugenes and retrotransposons, self-translating (. ) rRna, etc)

Most eukaryotes do not have operons, so they are commonly described in a prokaryotes-context. However, splicing is only found in eukaryotes, hence exons and introns are not found in prokaryotes. So I think it should not necessarily be mentioned here. But the exon article definitely needs expansion. Ilikelifesciences (talk) 09:07, 2 September 2015 (UTC)

The comment(s) below were originally left at Talk:Exon/Comments , and are posted here for posterity. Following several discussions in past years, these subpages are now deprecated. The comments may be irrelevant or outdated if so, please feel free to remove this section.

Rated "high" as high school/SAT biology content, part of gene structure. - tameeria 23:48, 18 February 2007 (UTC)

Last edited at 23:48, 18 February 2007 (UTC). Substituted at 14:51, 29 April 2016 (UTC)

Intron definition, exon definition and back-splicing revisited

Pre-mRNA splicing is performed by the sequential function of different spliceosome complexes. Spliceosome assembly depends on the presence — from the 5′ end to the 3′ end of introns — of conserved 5′ splice site (5′SS), branch point sequence (BPS) and 3′SS. Of all spliceosome complexes, the only one lacking a resolved structure is the first to assemble on the pre-mRNA, known as the E complex. It remains unclear how the spliceosome initially defines (recognizes and assembles across) introns or exons, and how canonical splicing is favoured over a non-canonical reaction known as back-splicing, which generates exonic circular RNAs (circRNAs). Li et al. now present the cryo-electron microscopy (cryo-EM) structure of the E complex of Saccharomyces cerevisiae, which suggests that intron definition, exon definition and back-splicing can be carried out by the same complexes.

In yeast, which typically contain short introns and long exons, intron definition seems to dominate. By contrast, in vertebrates, where introns are longer and exons are shorter, exon definition is thought to prevail. To study these mechanisms in detail, the authors assembled in vitro functional budding yeast E complexes on either the ACT1 pre-mRNA or the UBC4 pre-mRNA, and determined their cryo-EM structures.

Key to initiating splicing through intron definition is bringing together the 5′SS and the BPS. A striking feature found in the structure of the ACT1–E complex was a

25 bp double helix downstream of the 5′SS, which was not found in the UBC4–E complex. The 5′SS-to-BPS region (265 nt) of the ACT1 intron, but not the same region in UBC4 (58 nt), is predicted to form long stem-like structures, and mutations that abolish these structures resulted in substantial pre-mRNA accumulation (splicing inhibition). Thus, secondary structures may help to bring together essential intronic elements and facilitate spliceosome assembly.

The structure of the E complex, especially the relative positions of the 5′SS and the BPS, suggested that the same E complex can form across exons: instead of connecting a 5′SS with a downstream BPS across an intron, the 5′SS could connect with an upstream BPS across an exon. To test whether exon definition occurs in vivo in yeast, the authors truncated the DYN2 gene to contain only its middle exon and flanking introns (IEI construct) and mutated the splicing elements bordering either side of the DYN2 exon — a BPS mutation in intron 1 and a 5′SS mutation in intron 2. If splicing of DYN2 is governed solely by intron definition, retention of the intron in which a mutation resides would be expected, with minimal effect on the other intron however, if splicing is governed by exon definition, each of the mutations would lead to the retention of both introns. The observed composition of splicing intermediates produced from the different IEI mutants suggested that both intron definition and exon definition occur in vivo and contribute to correct splicing of DYN2 and showed, for the first time, that exon definition occurs in yeast.

A prediction of this model is that formation of exon definition complexes across long exons leads to back-splicing — splicing of the 5′SS downstream with the 3′SS upstream of the exon — and the formation of exonic circRNAs, owing to the lack of steric hindrance. Shortening exons of circRNA-producing IEI constructs abolished circRNA formation from the constructs, supporting the preference for back-splicing across long (or multiple) exons and the occurrence of exon definition.

“both intron definition and exon definition occur in vivo and contribute to correct splicing”

In summary, in yeast (and likely in all eukaryotes) the same E complexes are able to define both introns and exons, without the need for additional components or structural rearrangements. Furthermore, exon definition can cause back-splicing across long exons, suggesting that circRNAs are natural by-products of spliceosome-mediated splicing in all eukaryotes.

Biology 171

By the end of this section, you will be able to do the following:

  • Describe the different steps in RNA processing
  • Understand the significance of exons, introns, and splicing for mRNAs
  • Explain how tRNAs and rRNAs are processed

After transcription, eukaryotic pre-mRNAs must undergo several processing steps before they can be translated. Eukaryotic (and prokaryotic) tRNAs and rRNAs also undergo processing before they can function as components in the protein-synthesis machinery.

MRNA Processing

The eukaryotic pre-mRNA undergoes extensive processing before it is ready to be translated. Eukaryotic protein-coding sequences are not continuous, as they are in prokaryotes. The coding sequences (exons) are interrupted by noncoding introns, which must be removed to make a translatable mRNA. The additional steps involved in eukaryotic mRNA maturation also create a molecule with a much longer half-life than a prokaryotic mRNA. Eukaryotic mRNAs last for several hours, whereas the typical E. coli mRNA lasts no more than five seconds.

Pre-mRNAs are first coated in RNA-stabilizing proteins these protect the pre-mRNA from degradation while it is processed and exported out of the nucleus. The three most important steps of pre-mRNA processing are the addition of stabilizing and signaling factors at the 5′ and 3′ ends of the molecule, and the removal of the introns ((Figure)). In rare cases, the mRNA transcript can be “edited” after it is transcribed.

The trypanosomes are a group of protozoa that include the pathogen Trypanosoma brucei, which causes nagana in cattle and sleeping sickness in humans throughout great areas of Africa ((Figure)). The trypanosome is carried by biting flies in the genus Glossina (commonly called tsetse flies). Trypanosomes, and virtually all other eukaryotes, have organelles called mitochondria that supply the cell with chemical energy. Mitochondria are organelles that express their own DNA and are believed to be the remnants of a symbiotic relationship between a eukaryote and an engulfed prokaryote. The mitochondrial DNA of trypanosomes exhibit an interesting exception to the central dogma: their pre-mRNAs do not have the correct information to specify a functional protein. Usually, this is because the mRNA is missing several U nucleotides. The cell performs an additional RNA processing step called RNA editing to remedy this.

Other genes in the mitochondrial genome encode 40- to 80-nucleotide guide RNAs. One or more of these molecules interacts by complementary base pairing with some of the nucleotides in the pre-mRNA transcript. However, the guide RNA has more A nucleotides than the pre-mRNA has U nucleotides with which to bind. In these regions, the guide RNA loops out. The 3′ ends of guide RNAs have a long poly-U tail, and these U bases are inserted in regions of the pre-mRNA transcript at which the guide RNAs are looped. This process is entirely mediated by RNA molecules. That is, guide RNAs—rather than proteins—serve as the catalysts in RNA editing.

RNA editing is not just a phenomenon of trypanosomes. In the mitochondria of some plants, almost all pre-mRNAs are edited. RNA editing has also been identified in mammals such as rats, rabbits, and even humans. What could be the evolutionary reason for this additional step in pre-mRNA processing? One possibility is that the mitochondria, being remnants of ancient prokaryotes, have an equally ancient RNA-based method for regulating gene expression. In support of this hypothesis, edits made to pre-mRNAs differ depending on cellular conditions. Although speculative, the process of RNA editing may be a holdover from a primordial time when RNA molecules, instead of proteins, were responsible for catalyzing reactions.

5′ Capping

While the pre-mRNA is still being synthesized, a 7-methylguanosine cap is added to the 5′ end of the growing transcript by a phosphate linkage. This functional group protects the nascent mRNA from degradation. In addition, factors involved in protein synthesis recognize the cap to help initiate translation by ribosomes.

3′ Poly-A Tail

Once elongation is complete, the pre-mRNA is cleaved by an endonuclease between an AAUAAA consensus sequence and a GU-rich sequence, leaving the AAUAAA sequence on the pre-mRNA. An enzyme called poly-A polymerase then adds a string of approximately 200 A residues, called the poly-A tail . This modification further protects the pre-mRNA from degradation and is also the binding site for a protein necessary for exporting the processed mRNA to the cytoplasm.

Pre-mRNA Splicing

Eukaryotic genes are composed of exons , which correspond to protein-coding sequences (ex-on signifies that they are expressed), and intervening sequences called introns (int-ron denotes their intervening role), which may be involved in gene regulation but are removed from the pre-mRNA during processing. Intron sequences in mRNA do not encode functional proteins.

The discovery of introns came as a surprise to researchers in the 1970s who expected that pre-mRNAs would specify protein sequences without further processing, as they had observed in prokaryotes. The genes of higher eukaryotes very often contain one or more introns. These regions may correspond to regulatory sequences however, the biological significance of having many introns or having very long introns in a gene is unclear. It is possible that introns slow down gene expression because it takes longer to transcribe pre-mRNAs with lots of introns. Alternatively, introns may be nonfunctional sequence remnants left over from the fusion of ancient genes throughout the course of evolution. This is supported by the fact that separate exons often encode separate protein subunits or domains. For the most part, the sequences of introns can be mutated without ultimately affecting the protein product.

All of a pre-mRNA’s introns must be completely and precisely removed before protein synthesis. If the process errs by even a single nucleotide, the reading frame of the rejoined exons would shift, and the resulting protein would be dysfunctional. The process of removing introns and reconnecting exons is called splicing ((Figure)). Introns are removed and degraded while the pre-mRNA is still in the nucleus. Splicing occurs by a sequence-specific mechanism that ensures introns will be removed and exons rejoined with the accuracy and precision of a single nucleotide. Although the intron itself is noncoding, the beginning and end of each intron is marked with specific nucleotides: GU at the 5′ end and AG at the 3′ end of the intron. The splicing of pre-mRNAs is conducted by complexes of proteins and RNA molecules called spliceosomes.

Errors in splicing are implicated in cancers and other human diseases. What kinds of mutations might lead to splicing errors? Think of different possible outcomes if splicing errors occur.

Note that more than 70 individual introns can be present, and each has to undergo the process of splicing—in addition to 5′ capping and the addition of a poly-A tail—just to generate a single, translatable mRNA molecule.

View RNA Splicing (video) to see how introns are removed during RNA splicing.

Processing of tRNAs and rRNAs

The tRNAs and rRNAs are structural molecules that have roles in protein synthesis however, these RNAs are not themselves translated. Pre-rRNAs are transcribed, processed, and assembled into ribosomes in the nucleolus. Pre-tRNAs are transcribed and processed in the nucleus and then released into the cytoplasm where they are linked to free amino acids for protein synthesis.

Most of the tRNAs and rRNAs in eukaryotes and prokaryotes are first transcribed as a long precursor molecule that spans multiple rRNAs or tRNAs. Enzymes then cleave the precursors into subunits corresponding to each structural RNA. Some of the bases of pre-rRNAs are methylated that is, a –CH3 methyl functional group is added for stability. Pre-tRNA molecules also undergo methylation. As with pre-mRNAs, subunit excision occurs in eukaryotic pre-RNAs destined to become tRNAs or rRNAs.

Mature rRNAs make up approximately 50 percent of each ribosome. Some of a ribosome’s RNA molecules are purely structural, whereas others have catalytic or binding activities. Mature tRNAs take on a three-dimensional structure through local regions of base pairing stabilized by intramolecular hydrogen bonding. The tRNA folds to position the amino acid binding site at one end and the anticodon at the other end ((Figure)). The anticodon is a three-nucleotide sequence in a tRNA that interacts with an mRNA codon through complementary base pairing.

Section Summary

Eukaryotic pre-mRNAs are modified with a 5′ methylguanosine cap and a poly-A tail. These structures protect the mature mRNA from degradation and help export it from the nucleus. Pre-mRNAs also undergo splicing, in which introns are removed and exons are reconnected with single-nucleotide accuracy. Only finished mRNAs that have undergone 5′ capping, 3′ polyadenylation, and intron splicing are exported from the nucleus to the cytoplasm. Pre-rRNAs and pre-tRNAs may be processed by intramolecular cleavage, splicing, methylation, and chemical conversion of nucleotides. Rarely, RNA editing is also performed to insert missing bases after an mRNA has been synthesized.

Art Connections

(Figure) Errors in splicing are implicated in cancers and other human diseases. What kinds of mutations might lead to splicing errors? Think of different possible outcomes if splicing errors occur.

(Figure) Mutations in the spliceosome recognition sequence at each end of the intron, or in the proteins and RNAs that make up the spliceosome, may impair splicing. Mutations may also add new spliceosome recognition sites. Splicing errors could lead to introns being retained in spliced RNA, exons being excised, or changes in the location of the splice site.

Free Response

Chronic lymphocytic leukemia patients often harbor nonsense mutations in their spliceosome machinery. Describe how this mutation of the spliceosome would change the final location and sequence of a pre-mRNA.

Nonsense spliceosome mutations would eliminate the splicing step of mRNA processing, so the mature mRNAs would retain their introns and be perfectly complementary to the entire DNA template sequence. However, the mRNAs would still undergo addition of the 5’ cap and poly-A tail, and therefore each has the potential to be exported to the cytoplasm for translation.


Organellar and Metabolic Processes

Michel Goldschmidt-Clermont , in The Chlamydomonas Sourcebook , 2009

I. Introduction

The introns in the Chlamydomonas chloroplast genome belong to group I and group II. Introns of these two groups are also found in other organellar genomes, and in particular in plastids and mitochondria from higher plants. Many group I introns, and some group II introns from bacteria and from fungal mitochondria, are ribozymes that have the remarkable ability to catalyze their own excision from precursor transcripts in vitro. Some introns in both groups are mobile genetic elements that can insert into intron-less alleles or into novel sites. However the introns in the two groups differ in their architecture, their mechanisms of splicing and their modes of transposition.

Some introns in group I encode a polypeptide which can act as a maturase to facilitate intron splicing and as an endonuclease to promote intron movement. Likewise some members of group II encode a protein which promotes splicing, and also has endonuclease and reverse transcriptase domains required for intron insertion. However many introns in both groups do not contain such an open reading frame (ORF), or encode a protein that has retained only some of these functions. The structures and functions of group I and group II introns have been described in comprehensive reviews ( Bonen and Vogel, 2001 Herrin and Nickelsen, 2004 Lambowitz and Zimmerly, 2004 Haugen et al., 2005 Robart and Zimmerly, 2005 Pyle and Lambowitz, 2006 Houghland et al., 2006 ). This chapter will focus on aspects of group I and group II splicing that are particular to C. reinhardtii and other Chlamydomonas spp. ( Herrin et al., 1998 ).

DNA & mRNA: Introns and Exons

The finding of the Introns and the exons was one of the most significant discoveries in genetics in the past fifteen years. split genes were discovered when a lack of relation between DNA sequences were seen during. DNA- mRNA hybridization. For all new mRNA, they must be transcribed by RNA polymerase enzymes.

The transcription begins at the promoter sequence on the DNA and works down, thus the nucleotide sequence of the mRNA is complementary to the one of DNA. In eukaryotes, the mRNA is processed in the nucleus before transport to the cytoplasm for translation. In order for the mRNA to become a true functioning RNA, it must undergo several stages of modification.

At first, when the mRNA is produced, a cap is added enzymatically to the 5¹ end of the RNA by linking a 7-methylguanosine residue by a triphosphate bond this is called the G-cap. The G-cap is necessary for translation.

The subunit of the ribosome recognizes the G-cap and then finds the initiation codon to start the translation. As the mRNA comes finishes transcription, the Poly A tail is added to the 3¹ end. As the two ends are placed the mRNA becomes pre-mRNA.

The pre-mRNA consists of splicing and non-coding regions. pre-mRNA molecules are much longer than the mRNA molecule needed to code for its protein. The regions that do not code for amino acids aa, are scattered all along the coding region. The genes are split with coding regions, called exons, short for expressed regions in between the exons the non-coding region called introns exist. Before the translation of mRNA the introns must be spliced off.

Splicing is a complicated process for the cell. It must locate every intron in the primary transcript. An average mRNA consists of eight to ten introns, some even contain sixteen introns. exons, like introns, are also spread apart. Some of their codons may be split by introns, so information for a single amino acid could be some distance apart. Splicing takes place in the nucleus but also could take place in the cytoplasm and the mitochondria. After the splicing of the introns, the G-caps and the Poly A tails remain on the mRNA.

A single gene can code for multiple proteins by alternative splicing. A single strand was found to be coding for twenty different proteins, depending on how the exons are assembled. Different splicing combinations are regulated in a tissue-specific manner.

Most of the transcribed DNA are introns. ninety-nine percent of the information contained in the gene transcript is destroyed when the introns are eliminated since exons are only translated. Most genes have introns. Only a hand full of organisms is found without introns. Larger eukaryotes tend to have bigger and more numerous amounts of introns compared to smaller eukaryotes.

There are a sequence of nucleic acids at the exon-intron junction of mRNA allowing intron splicing., From what is known there is a GU at the 5¹ splice site and AG at the 3¹ splice site for most genes.

This is called the GU-AG rule Splicing enzymes recognize these sites with the help of ribonucleoprotein called snRNP or snurps. snurps are formed by small nuclear RNA fragments of less than three hundred nucleotides called snRNAs. As an RNA molecule is being transcribed, four snurps attach to it combining into a large spliceosome. The abundant snRNAs catalyze the cutting and the splicing of the gene.

In a self splitting intron, the hairpin structure brings the ends o the introns near to the branch point. Then the introns themselves catalyze the making of the loop joining the two exons. The difference between self-splicing intron and one which requires the spliceosome is that the non-self splicing introns can split any introns, almost any size.

This helps the organism to survive mutations. When a mutation forms, sometimes the self-splicing introns lose their hairpin structure not allowing themselves to be spliced off.

Help Us Fix his Smile with Your Old Essays, It Takes Seconds!

-We are looking for previous essays, labs and assignments that you aced!

Related Posts

Nucleic acids are the storage form of all the genetic information required by a cell.&hellip

DNA or nucleic acid is a polymer of nucleotides. A nucleotide consists of 3 molecules:&hellip

Mutations are errors made in the DNA sequence than can have negative side effects, no&hellip

DNA: DNA is the genetic information or blueprint of who and what we are, and&hellip

All living things have DNA: the chemical instructions on how to make a living thing.&hellip

Author: William Anderson (Schoolworkhelper Editorial Team)

Tutor and Freelance Writer. Science Teacher and Lover of Essays. Article last reviewed: 2020 | St. Rosemary Institution © 2010-2021 | Creative Commons 4.0

Collective term for both exons and introns - Biology

I n the 1960s non-bacterial (eukaryotic) ribosomal RNAs (rRNAs) were found to be synthesized as a long precursor RNA which was subsequently processed by the removal of apparently functionless internal " spacer " sequences. Since bacterial (prokaryotic) rRNAs were more compactly organized, it was reasonable to ask whether the first rRNAs to evolve had the spacer sequences, which subsequently decreased in prokaryotes, or whether the spacer sequences were later acquired in eukaryotes.

  • What function(s), if any, did introns have?
  • Were introns "early" or "late"?

If introns could be dispensed with in bacteria, then perhaps they had no function. Alternatively, whatever function introns had, either was not necessary in bacteria, or might be achieved in other ways by bacteria. Since members of many bacterial species appeared to be under intense pressure to streamline their genomes to facilitate rapid replication, if it were possible they would have dispensed with any preexisting introns and/or would have been reluctant to acquire them. On the other hand, if introns played a role and/or did not present too great a selective burden, eukaryotes would have tended to retain preexisting introns, or could have acquired them.

Knowing the function of introns seemed critical for sorting out these issues. There were many ingenious suggestions. Some thought introns were just another example of the apparently non-utile " junk " DNA which littered the DNA of many eukaryotes. However, some principles to guide investigation of a possible error-checking role were presented ( Forsdyke 1981 ), and there is now growing evidence that introns play such a role ( Forsdyke 1995a,b ), although the mechanism may be somewhat different to that originally proposed ( Liebovitch et al. 1996. Biophys. J . 71, 1539-1544 ). It appears that the order of bases in nucleic acids might have been under evolutionary pressure to develop the potential to form stem-loop structures which would facilitate " in-series " or " in-parallel " error-correction by recombination.

Although the genetic code is degenerate (more than one codon per amino acid so that there is some flexibility as to which base occupies a particular position), there is still room for conflict between the " desires " of a sequence to encode both a protein (or non-messenger RNA) and stem-loop potential. The conflict would be particularly apparent in the case of genes under very strong positive phenotypic (Darwinian ) selection, as in the case of genes affected by " arms races " with predators or prey.

For example, snake venom may decrease the rodent population (prey) until a venom-resistant rodent line develops and expands. Now, while the rodent population increases, the snake population (predators) decreases, because it cannot obtain sufficient food. This decrease continues until a line of snakes arises with more active venom, which overcomes the resistance. This line of snakes now increases, and the rodent population begins to fall again. This cycle constitutes an " arms race ", and influences particular genes. The part of the venom protein which is important for toxicity is required to change so very rapidly in response to this strong phenotypic pressure from the environment, that the corresponding gene can no longer afford the " luxury " of trying to encode both the best protein and the best stem-loops. So the stem-loop role is left to the introns. Here, paradoxically, sequence conservation is high , whereas in the exons, sequence conservation is low ( Forsdyke 1995b ). Similar pressures may be acting of the peptide binding regions of the genes encoding major histocompatibility antigens ( Forsdyke 1996b ).

Subsequent to the publication of the latter papers, another " player " in the conflict between protein-encoding potential and stem-loop potential emerged. Most mRNAs are "purine-loaded" in the loop regions of stem-loop structures ( reviewed in Forsdyke & Mortimer, 2000 ). The selection pressure for this appears to operate primarily at the cytoplasmic level. Consequently, the purine-loading may not optimally serve the postulated genomic role of stem-loops. There may be a conflict between "AG-pressure" (the pressure to purine-load) and stem-loop pressure. To resolve this, stem-loop potential would be moved to a region where AG-pressure does not operate, the introns.

Scherrer, K. et al. (1970) Nuclear and cytoplasmic messenger-like RNA and their relation to the active messenger RNA in polyribosomes of HeLa cells. Cold Spring Harb. Symp. Quant. Biol . 35, 539-554.

The Journal of Theoretical Biology (1981) 93, 861-866 [ With the permission of the copyright holder, Academic Press]

(Received 17 December 1980, and in revised form 28 May 1981)

Summary . The probability of the accurate transmission of a message sequence can be increased by the addition of non-message sequences which permit errors in the message sequence to be detected and corrected. It is proposed that sequences in introns (or in other non-message genomic regions) serve this function with respect to the transmission of genetic information.

In eukaryotes the sequence of DNA bases coding for a protein is often found to be interrupted by sequences of bases (introns) which show no obvious relationship to the coding sequence (Gilbert, 1978). Speculation on the possible role of introns has included the view that they are examples of " junk " or " selfish " DNA, which does not contribute positively to cell function (Doolittle & Sapienza, 1980 Orgel & Crick, 1980). However, the notion of message sequences interrupted by non-message sequences is quite familiar to those working on noise affecting signal transmission in electrical systems. In these systems the non-message sequences have an error-checking function and permit the receiver to detect and correct errors in the message sequence (Hamming, 1980). Some principles which may guide investigations of a possible error-checking role for introns are outlined in this paper.

2. Error-correction In-parallel

The linear sequence of DNA bases consists of the purines (R, adenine and guanine) and the pyrimidines (Y, thymine and cytosine). When suggesting the duplex structure of DNA, Watson & Crick (1953) pointed out that the sequence of one strand could be derived from the parallel strand, given the algorithm that purines pair only with pyrimidines (adenine with thymine and guanine with cytosine). Thus, a duplex with correct base-pairing can be written:

If there were " noise " due to an abnormal base ( Z ) in one strand of the duplex, then the error could be corrected using information provided by the parallel complementary strand.



In this example the error-checking system recognizes Z as abnormal so that it is excised. The correct base is then inserted. However, if the noise were due to a normalbase in the wrong position, then the error-checking system would not know which strand contained the correct sequence. There would only be a 50% probability of correcting the error.


Replace base in top or bottom strand, resulting in either a corrected sequence


In this example the error-checking system recognizes that R-with-R pairing is incorrect and changes either the R in the top strand (error corrected ) or the R in the bottom strand (error compounded ). In a diploid organism containing two homologous parallel copies of duplex DNA it should be possible to correct noise due to a normal base in the wrong position with 100% probability of success.


+ Homologous duplex



In this example the error-checking system determines which of the two strands contains the error by comparing the duplex with the error with the homologous duplex. However, if the error were due to a switch in a base pairso that both strands of the duplex were affected without infringement of the purine-with-pyrimidine pairing rule, then comparison with a homologous duplex would permit only a 50% probability of error correction.





In this example, the error-checking system determines that there has been an error and at what part of the sequence it has occurred. The error-checking system cannot determine which of the homologous duplexes contains the primary error. A solution to the problem would be to bring in further homologous duplexes from other cells (e.g. by sexual mating or by fusion of somatic cells). The extraneous duplexes could then be compared with the cell's own duplexes.

Another solution would be to have homologous duplexes in-series in the cell's own DNA as suggested by Callan (1967) in his " master-slave gene " hypothesis. Callan assumed that the in-series sequences would be identical and that the continuity of the sequence coding for a particular protein would not be interrupted. An explanation of introns in these terms needs to explain why introns interrupt message sequences and why the sequence of an intron is so different from the exons surrounding it?

The former question can be answered by considering the structure of information systems. A system of information, such as the paragraph you are now reading, has a structure which imposes itself upon the message sequence. Thus a sentence is interrupted at the end of a line and continues on the next line. It is possible that the proposed error-checking system is structurally constrained so that error-checking sequences must appear in certain positions. This structure would also determine which parts of the DNA sequence are checked by a particular error-checking sequence. Thus the exons next to a particular intron need not be checked by that intron. This could explain why intron sequences are so different from neighbouring exons. Another explanation would be that the algorithm relating intron sequences to exon sequences is different from the classical base-pairing algorithm (Watson & Crick, 1953). An example can be derived by analogy with error-checking codes in electrical systems.

One of the simplest error-correcting codes uses a parity check algorithm to detect error (Hamming, 1980). Thus, if for purine we score 1 and for pyrimidine we score zero, a triplet of three bases can be summed with a fourth checking base in order to maintain even parity.

Check base
Check base

The check bases could be collected together and the error-checking system would be able to apply each of them to the appropriate part of the message sequence.

Detection of non-parity would show that one of the four bases (three message and one check base) was incorrect.

To localize the error more precisely, the message could be spread in a two-dimensional matrix (say over the surface of a nucleosome) and two sets of checking bases could be applied.


The message sequence is shown here coiled up and the two checking sequences are arranged vertically and horizontally. There is no requirement for colinearity between message sequences and checking sequences. An error, say in the central of the nine message bases (Y to R), would register with the central base of each checking triplet and so would be precisely located. The ratio of checking bases to message bases would be 2/3.

4. Operation in vivo

A system similar to this could exist in eukaryotes for preventing purine-pyrimidine interchanges (transversions), the consequences of which are likely to be more detrimental than purine-purine and pyrimidine-pyrimidine interchanges (transitions). However, eukaryotes have probably evolved a more complex system capable of correcting both transversions and transitions.

It is not implausible that DNA molecules could be arranged in cells to produce geometrical arrangements of bases which could be read against checking sequences. Enzymes exist which can literally tie DNA into knots (Cozzarelli, 1980 Liu, Liu & Alberts, 1980). The function of many DNA-binding proteins is unknown, but at least some of these are involved in error-recognition and correction (Kornberg, 1980).

There are a number of differences between prokaryotes and eukaryotes in the organization of DNA which would be explained if the latter had evolved a superior error-checking system associated with nucleosomes. Prokaryote DNA polymerases have the ability to " proof-read ", yet most eukaryotic DNA polymerases do not have this ability (Kornberg, 1980 [More are now known to have this property] ). Prokaryotes do not have a variety of histones, which are the major protein components of nucleosomes (Hubscher, Lutz & Kornberg, 1980). The histones form a core and DNA is wrapped around the core (Laskey & Earnshaw, 1980). According to the geometry of the nucleosome, the winding of the DNA should allow in-series DNA sequences to approach each other, thus providing an opportunity for an error-checking system to operate.

In the transversion-detecting model discussed above, a set of four bases would be read in relationship to each other to check that they form one of a subset of acceptable base patterns. Mutations occurring in either checking bases or message bases would be recognized and then, if possible, corrected. Fixation of a mutation in the genome would normally require a concomitant change in two bases of a set. Thus error-checking and message bases would coevolve. Information in introns (or other non-message, error-checking sequences) would coevolve with information in exons. Since the model does not require that an exon be checked by contiguous introns, the latter might evolve at a different rate from surrounding exons (Konkel, Maizel & Leder, 1979).

The complex of an exon and its checking intron(s), perhaps located in widely separate parts of the genome, would comprise a unit, permanent damage to any part of which would damage the exon. These information units could be large in organisms with large amounts of DNA and each unit would constitute a large target for mutagenic agents. Thus it would be expected that the mutation rate per locus would increase as the ratio of total genomic DNA to exon DNA increases. Studies of mutagenesis by X-rays and ethyl methane sulphonate are consistent with this (Heddle & Athanasiou, 1975).

This paper has presented some principles to assist studies of a possible error-correcting role of introns. It should not be difficult to search for algorithms relating intron sequences to exon sequences. A central computer bank of DNA sequences can be readily accessed (Dayhoff et al., 1980 Gingeras & Roberts, 1980). Rapid computerized decoding techniques have been developed for military purposes. The discovery of such algorithms would justify a search for mechanisms by which exon sequences could be checked against intron sequences.

The author thanks Dr Keith Randle of the Royal Military College of Science, Shrivenham, for drawing attention to error-correcting codes, and Dr Stafford Tavares of Queen's University, for helpful discussion.

CALLAN, H. G. (1967). J. Cell Sci . 2, 1.

COZZARELLI, N. R. (1980). Science207, 953.

DAYHOFF, M. O., SCHWARTZ, R. M., CHEN, H. R., HUNT, L. T., BARKER, W. C. & ORCUTT, B. C. (1980). Science209, 1182.

DOOLITTLE, W. F. & SAPIENZA, C. (1980). Nature, Lond . 284, 601.

GILBERT, W. (1978). Nature, Lond . 271, 501.

GINGERAS, T. R. & ROBERTS, R. J. (1980). Science209, 1322.

HAMMING, R. W. (1980). In: Coding and Information Theory . pp. 21-49. Englewood Cliffs: Prentice-Hall.

HEDDLE, J. A. & ATHANASIOU, K. (1975). Nature, Lond . 258, 359.

HUBSCHER, U., LUTZ, H. & KORNBERG, A. (1980). Proc. natn. Acad. Sci. U.S.A . 77, 5097.

KONKEL, D. A., MAIZEL, J. V. & LEDER, P. (1979). Cell18, 865.

KORNBERG, A. (1980). In: DNA Replication . pp. 101-229. San Francisco: Freeman & Co.

LASKEY, R. A. & EARNSHAW, W. C. (1980). Nature, Lond . 286, 763.

LIU, L. F., LIU, C. C. & ALBERTS, B. M. (1980). Cell19, 697.

ORGEL, L. E. & CRICK, F. H. C. (1980). Nature, Lond . 284, 604.

WATSON, J. D. & CRICK, F. H. C. (1953). Nature, Lond . 171, 263

I n 1985 when considering gene duplication, Russell F. Doolittle pondered ( Trends in Biochem. Sci . June ):

"I still favor the notion that the principal selective value of introns concerns genetic maintenance, as was suggested by many when introns were first discovered. It is the preventing of mismatches of duplicated genes during meiosis and mitosis that is so crucial to eukaryotes. The free and easy nucleotide substitution that occurs in introns should serve as a buffer against [chromosome] mispairing while allowing the advantages of sets of similar gene products. Sets of multiple gene products allow adaptation for regulation of different parts of the life cycle or for different tissues."

Doolittle's focus here is on the prevention of pairing between the sequences of duplicate genes. By preventing pairing, potential recombination events would be discouraged. Since duplicate genes have similar sequences, and recombination can occur when sequences are similar, then the introduction of introns that are more able to accept mutations (i.e. to diversity), would serve to preserve the duplicate genes (i.e. prevent them blending by recombination). This theme was extended to involve differences in GC% and isochores by Matsuo and coworkers, and quotations from their important 1994 paper are here reproduced with the copyright permission of Walter Schaffner.

I ntrons are usually more highly diversified than neighbouring exons. From detailed studies of POU domain gene families, in a 1994 paper Matsuo and coworkers were led to suggest that "diverged introns would reduce the length of sequence that is common in two DNA segments, resulting in a dramatic reduction of the efficiency of homologous recombination." Thus, the diversification would serve to decrease intragenomic recombination between similar genes. "This would help to explain a long-known paradox, namely that most introns are preserved [between species] even though their actual sequence hardly seems to matter." See paper entitled:

" Short Introns Interrupting the Oct-2 POU Domain May Prevent Recombination between POU Family Genes without Interfering with Potential POU Domain 'Shuffling' in Evolution ," by Koichi Matsuo, Oliver Clay, Patrik Kunzler, Oleg Georgiev, Pavel Urbanek, and Walter Schaffner Biological Chemistry Hoppe-Seyler 375, 675-683.

The authors extrapolated this principle to exons, where third codon positions, generally free from protein-encoding constraints, can also play a role in antirecombination:

"Another 'line of defence' [against homologous recombination] may be the overall DNA sequence composition and especially synonymous codon choice. We have compared all the available cDNA sequences of POU domains and found that even in cases where the amino acid sequence is highly similar, different codons tend to be used between POU domain factors within the same species. . The G + C content of the Oct-2 POU domain DNA is high, while that of the OCT-1 POU domain DNA is intermediate. . Finally, the Pit-1 POU domain is A + T-rich. . The third codon base conservation between some orthologous genes in different species (such as the Oct-2 gene in human and mouse) may thus reflect a constraint to avoid recombination with related sequences elsewhere in the genome, rather than a direct selection, say for codon usage or mRNA structure. One possible way to maintain a codon bias would be to deposit members of a gene family in different so-called isochores, which are defined as long (>300 kb) genomic DNA segments of a characteristic G + C content . ."

This suggests a discordance with the general theme of this web-page, which relates to nucleic acid structure. However, structure is exquisitively sensitive to small GC% differences ( Click Here Forsdyke, 1998). Thus, it is proposed that GC% differences cause structural differences which, in turn, impede recombination (see below).

Letter to the Editor . [Declined circa 1994 by Nature . Published by the author on Bionet.]

SIR - In his New & Views item entitled "The uncertain origin of introns" 1 Laurence Hurst presents some of the arguments for " introns early " (the Gilbert school 2 ) and " introns late " (the Stoltzfus school 3 ). Both schools seem not to have noticed that introns interrupt both coding and non-coding parts of genes 4 .

It has long been known that genes for rRNAs and tRNAs contain interruptions, but these may be special cases. Recently, however, " mRNAs " have been discovered which have no protein product . The corresponding genes look like most protein-encoding genes, and possess multiple introns 5 . Thus, introns interrupt genetic information , not just protein-encoding information . It is not too surprising then, that it is difficult to associate exons with domains of protein structure or function 2,3 . It does not follow that this disposes of the introns early viewpoint. There may other exon theories of genes, as well as " the " exon theory of genes.

Evidence for an alternative exon theory of genes has come with the discovery that there has been a genome-wide pressure on sequences to accept mutations favouring stem-loop formation 6 . Adapting the procedure of Le and Maizel 7 , long sequences from a variety of species were divided into 200 nt windows, each of which was submitted to a computer program which determines the minimum free energy of folding into stem-loop structures. The sequence in each window was then shuffled and again subjected to the folding program. It was found consistently that the folding of natural sequences is energetically more favourable than the folding of the corresponding shuffled sequences. This finding that stem-loop-forming potential is widely dispersed in genomes is consistent with growing evidence that the " kissing " between loops at the tips of stems on homologous chromosomes is rate-limiting in recombination 8,9 . Mutations which favour recombination may affect either the enzymes involved in recombination, or their substrate, DNA itself (hence stem-loops).

First synonymous codons were used so that a sequence could at the same time both optimize its folding propensity and encode a protein.

If this failed, then conservative amino acid exchanges were accepted to widen the range of codon choice without impairing protein function.

Finally, if these failed, the protein was permitted only to evolve in segments interrupted by regions of high stem-loop-forming potential.

Remarkably, traces of this primitive arrangement can be discerned in some modern genes (D.R.Forsdyke, unpublished work 6 ). In the compact genome of C. elegens stem-loops are abundant and 43% of these occur in introns, which represent only 20% of the genome 12 . Detailed models for the role of stem-loops in recombination are available 13-16 .

Department of Biochemistry,

Kingston, Ontario, Canada K7L3N6

1. Hurst,L.D. Nature 371, 381-382 (1994).

2. Gilbert,W. & Glynias,M. Gene 135, 137-144 (1994).

3. Stoltzfus et al. Science 265, 202-207 (1994).

4. Hawkins, J.D. Nucleic Acids Res. 16, 9853-9905 (1988).

5. Pfeifer, K. & Tilghman, S.M. Genes Devel. 8, 1867-1874 (1994).

6. Forsdyke, D.R. FASEB.J. 8, 1395A (1994).

7. Le, S-Y. & Maizel, J.V. J. Theor. Biol. 138, 495-510 (1989).

8. Kleckner, N., Padmore, R. & Bishop, D.K. Cold Spring Harbour Symp. Quant. Biol. 56, 729-743.

9. Kleckner, N. & Weiner, B.M. Cold Spring Harbour Symp. Quant. Biol. 58, 553-565.

10. Joyce, G. F. & Orgel, L. E. The RNA World, 1-25 (Cold Spring Harbour Laboratory Press, New York).

11. Bernstein, C. & Bernstein, H. Aging, Sex and DNA Repair, (Academic Press, San Diego).

13. Sobell, H.M. Proc. Natl. Acad. Sci. USA 69, 2483-2487 (1972).

14. Wagner, R.E. & Radman, M. Proc. Natl. Acad. Sci. USA 72, 3619-3622 (1975).

15. Doyle, G.G. J. Theor. Biol 70, 171-184.

A Stem-Loop "Kissing" Model for the Initiation of Recombination and the Origin of Introns

D. R. Forsdyke (1995a) Molecular Biology and Evolution 12, 949-958.

[JAN KLEIN, reviewing editor Received February 18, 1995 Accepted May 3, 1995 . Reproduced with copyright permission from the Editor, Simon Easteal]

Key words : recombination, stem loop, homology search, introns early, (G+C)/(A+T) ratio, speciation.

Summary . Mutations which improve the efficiency of recombination should affect either the proteins which mediate recombination or their substrate, DNA itself. The former mutations would be localized to a few sites. The latter would be dispersed . Studies of hybridization between RNA molecules have suggested that recombination may be initiated by a homology search involving the " kissing " of the tips of stem loops. This predicts that, in the absence of other constraints, mutations which assist the formation of stem loops would be favored . From comparisons of the folding of normal and shuffled DNA sequences, I present evidence for an evolutionary selection pressure to distribute stem loops generally throughout genomes . I propose that this early pressure came into conflict with later local pressures to impose information concerning specific function. The conflict was accommodated by permitting sections of DNA concerned with a specific function to evolve in dispersed segments. Traces of the conflict seem to be present in some modern intron-containing genes. Thus, introns may have allowed the interspersing of selectively advantageous stem loops in coding regions of DNA.

The possession of efficient mechanisms for homologous recombination would appear to be evolutionarily advantageous (Bernstein and Bernstein 1991). Thus, mutations in DNA which improve the efficiency either of the proteins concerned with recombination, or of the DNA to act as a target for these proteins, would be evolutionarily advantageous. Mutations improving the efficiency of proteins would be expected to be localized to the region of the corresponding genes. However, mutations improving the ability of DNA to act as a recombination substrate would be expected to be widely distributed .

The view that prior formation of a synaptonemal complex is essential for the initiation of meiotic recombination has recently lost ground to the view that an initial sequence-based homology search precedes chromosome pairing (Hawley and Arbel 1993). In 1971 Crick advanced the " unpairing postulate" that supercoiled duplex DNA in chromatin would form double-stranded stem-loop structures. At the tips of these structures individual strands would unpair, providing the opportunity for interactions with single-stranded DNA at the tips of similar stem loops on other chromosomes. This would allow a homology search and the pairing of homologous chromosomes during meiosis. There was no clear prediction that the primary sequence could have evolved to facilitate this.

Subsequent studies of the mechanism of hybridization between the sense and antisense RNAs involved in regulation of the replication of plasmid ColE I indicated that an initial homology search between the complementary single-strandedRNAs involves weak, reversible, transient " kissing " interactions between the tips of stem-loop structures (Tomizawa 1984). Supporting evidence for a role of stem loops in recombination between RNA genomes was provided by Romanova et al. (1986). This raised the possibility that similar stem-loop structures, generated when duplex DNA adopts a cruciform configuration, are involved in the initial homology search during meiosis (Sobell 1972 Wagner and Radman 1975 Kleckner and Weiner 1993).

If stem-loop formation in duplex DNA facilitates the initiation of homologous recombination, then there should have been a mutational pressure to produce dispersed sets of complementary oligonucleotide pairs . It is known that the frequencies of members of complementary oligonucleotide pairs are similar in DNA (Pradhu 1993 Forsdyke 1995b). However, it is not known whether the complementary pairs are sufficiently colocalized to permit extensive stem-loop formation.

An algorithm for the computation of " statistically significant " stem-loop potential was introduced by Le and Maizel (1989). This involved comparison of the folding into stem-loop structures of small windows in nucleic acid sequences, with the folding of the same windows after randomizing base order. In this paper, their approach is adapted for the determination of the distribution of stem loops in long DNA molecules. I first examine the distribution of stem-loop potential in some of the long DNA sequences (e.g., 68 kb) which have recently become available. The results support the single-stranded DNA stem-loop " kissing " model. I then speculate that an early genome-wide evolutionary pressure for the formation of dispersed stem-loop structures came into conflict with later local pressures to encode specific functions affecting phenotype . Finally, I present a survey of a variety of modern genes to see if traces of the conflict are still discernible. The results cast a new light on both the exon/intron problem (Gilbert and Glynias 1993 Stoltzfus et al. 1994) and the problem of why different genomes or genome compartments have distinct (G+C)/ (A+T) ratios (Wyatt 1952 Bernardi 1989). The latter is discussed more fully elsewhere (unpublished data [see Forsdyke 1996a] ).

The secondary structure of DNA in single-stranded form is very sensitive to small changes in sequence (Orita et al. 1989 [ see footnote ] ). Provided the length is not excessive, any such sequence can be analyzed using computer programs such as FOLD (Zuker 1989) to arrive at a theoretical optimum secondary structure of minimum free energy. Because of the greater strength of GC bonds relative to AT bonds, a GC-rich sequence tends to have a more stable structure than an AT-rich sequence of the same length. However, it does not follow that the most stable structures are likely to be of most local functional relevance. The bases in a segment might show poor complementarity, but if the few complementary pairs were GC pairs, the stability of the folded molecule might be quite high. Base composition is a genome , or genome sector (isochore), " strategy ," which has a major influence on codon choice, particularly of synonymous codons (Grantham et al. 1980). Base composition is not a local " strategy ." Codons are a local " strategy ."

Confidence that a given secondary structure is of local functional relevance is greater if it can be shown that the sequence has accepted mutations which enhance stem-loop formation (i.e., that the actual sequence of bases, in addition to base composition, has contributed to secondary structure stability). In principle, this should be possible by comparing the folding of a natural sequence with that of randomized versions of the same sequence.

A natural sequence is but one member of a large set of possible sequences with the same base composition. The averagecharacteristics of this set can be arrived at by randomizing the natural sequence. Randomization (shuffling) destroys information present in the primary sequence (base order), without changing base composition or sequence length. Thus, provided length is kept constant, average characteristics reflect base composition alone.

Programs of the Genetics Computer Group Inc. (Gribskov and Devereux 1991) were made available online through the services of the Molecular Biology Data Service of the National Research Council, Ottawa, which include access to a Silicon Graphics Challenge XL computer. Sequences were randomized using the program SHUFFLE. The outputs from the latter could be used directly by the program FOLD (Zuker 1989), which finds a secondary structure of minimum free energy using the energy values for base stacking and loop destabilization assigned by Turner et al. (1988). This program was designed for the study of secondary structure in non-supercoiled, single-stranded RNA molecules. It is assumed here that it is applicable to DNA. This assumption is not entirely valid. For example, the 2' hydroxyl group in RNA can contribute of the order of I kcal/mol to helix stability (Turner and Bevilacqua 1993). Since the present study is mainly concerned with differences in fold energies rather than absolute values, differences between RNA and DNA are considered of minor importance. A unix script program, SHUFFOLD, was written to determine the minimum free energy of folding of natural and shuffled versions of successive overlapping 200nt windows from nucleic acid sequences. Another program, STATS, was written to subject the output from SHUFFOLD to statistical analysis in the Minitab system (Ryan and Joiner 1994).

For each 200-nt window, FOLD first determines the minimum free energy value for folding of the natural sequence (FONS value). This is a function of both base composition and base order and measures the " total stem-loop potential " of a region. Then 10 random sequences are generated from the same window, and each randomized sequence is submitted to FOLD. The mean minimum free energy value for the 10 sequences (FORS-M value) provides a measure of the contribution of base composition alone to the stem-loop potential (" base composition-determined stem-loop potential "). Since the FONS value is usually more negative than the FORS-M value, the difference between the two values (FORS-M less FONS) is usually positive.

This difference (the "FORS-D" value) provides a measure of the contribution of base order alone to the stem-loop potential. Thus, a positive FORS-D value defines and quantitates the " base order-determined stem-loop potential ." This closely corresponds to the " segment score " of Le and Maizel (1989), which is used to assess " statistically significant " stem-loop potential. A negative FORS-D value in a region may mean that base order has been adapted to serve some other potential.

FIG. 1 .- Fold energy minimization values (FORS-M, FONS) and differences (FORS-D) for the 68-kb GenBank sequence HUMMMDBC. Nonoverlapping windows (200 nt) were selected at 1,000-nt intervals (i.e., 1-200, 1,001-1,200, 2,001-2,200, etc.). Secondary structure energy minimization values, determined using the program FOLD, were obtained for each window in the natural sequence (FONS values). Each 200-nt sequence was then subjected to 10 independent randomizations, and FOLD values for each of the 10 randomized sequences (FORS) were determined. In B the mean FOLD value for each set of 10 randomized sequences (FORS-M) is plotted with the corresponding FONS value. In A the differences between the FORS-M values and the corresponding FONS values are plotted (FORS-D values + standard error). FORS-M and FONS values are both negative. FORS-D values are positive whenever the FONS value is more negative (i.e., corresponds to a higher folding energy) than the FORS-M value. Each data point is at the middle of its 200-nt window.

Positive FORS-D Values Are Dispersed in Genomic DNA

A study of folding energies was carried out on one strand of the 68-kb human chromosome 19 segment HUMMMDBC (GenBank name). If there had been a genome-wide pressure to maximize the potential to form secondary structures, then positive FORS-D values would be expected . To get a general impression of the folding propensity, the FOLD program was applied to isolated 200-nt windows, at 1-kb intervals. The FONS and FORS-M profiles tend to follow each other (fig. I B). Positive FORS-D values are widely dispersed and greatly exceed negative FORS-D values (fig. 1A average 4.37+0.90 kcal/mol). Similar results are obtained for the HUMHBB segment from chromosome 11 (average 4.49+1.34 kcal/mol for 15 windows at 5-kb intervals) and for the HUMHDABC segment from chromosome 4 (average 3.59+1.62 kcal/mol for 12 windows at 5-kb intervals). Thus positive FORS-D values are widely dispersed in segments from small, medium, and large human chromosomes . Long DNA fragments from Drosophila melanogaster, Escherichia coli, and bacteriophage lambda also showed significant positive average FORS-D values (Forsdyke 1995b).

The dispersal of positive FORS-D values throughout genomes indicates a general evolutionary pressure on DNA molecules to accept mutations which would promote stem-loop formation by affecting base order. This supports the stem-loop " kissing " model for recombination as outlined in the Introduction. FORS-D profiles show negative values in some regions, suggesting conflicts between a general pressure on a gene to optimize its folding propensity and local pressures for some function.

FIG. 2 .-Genome-wide and local evolutionary forces acting on a genome. The two upper rows of dispersed downward-pointing arrows symbolize two genome-wide forces which influence the potential to form stem-loop structures. The first row of arrows symbolizes the pressure to adopt a particular species-specific ( C+G)/(A+T) ratio ("CG/ AT pressure"). This provides the base composition-determined component of the stem-loop potential. A pressure acting to increase the ratio would favor stem-loop formation. A pressure acting to decrease the ratio would diminish stem-loop formation. The equilibrium between the two latter pressures is determined by interactions with other species (and incipient species) in the same environment (see Discussion). The second row of arrows symbolizes a pressure to increase stem-loop formation by favoring the formation of inverted repeats with complementary oligonucleotide sequences. This provides the base order-determined component of the stem-loop potential and may be quantified as the FORS-D value. The lower row of upward-facing paired arrows symbolizes pressures for the encoding of specific functions in local regions of the genome (e.g., protein-mediated functions).

Potential conflicts between general and local pressures on evolving DNA molecules are summarized in figure 2. The two rows of downward-pointing arrows symbolize pressures acting throughout the genome. One of these is the general pressure to set the (C+G)/(A+T) ratio (base composition) at a particular level in a particular genome or genome segment ("CG/AT pressure" see Discussion). The second is the pressure, quantified as the FORS-D, to generate sequences with complementary oligonucleotides situated so as to favor the formation of stem loops. These two sets of arrows are pointing in the same direction, indicating that there is little conflict between them. The upward-pointing arrows are in distinct regions, symbolizing localized evolutionary pressure for the encoding of specific function. Here there is a conflict. A sequence required to encode a protein might not simultaneously be able to optimize its folding propensity.

FIG. 3 .-Fold energy minimization values (FORS-M, FONS) and differences (FORS-D) for A, the 4,102-nt sequence containing the G0S19-1 cytokine gene (GenBank name HUMG0S19A), and B, the corresponding cDNA. Each data point corresponds to the middle of a 200-nt window. Each window overlaps the preceding window by 150 nt, except that the first of the 41 overlapping windows of the genomic sequence spans nt 1- 94. Thus windows, in order, correspond to nt 1- 1 94, nt 45-244, nt 95-294, and so forth. This determines that the beginning of exon I corresponds to a new window (nt 1,995-2,194). In B there are 14 overlapping windows which begin with nt 1. The last window of the 778-nt cDNA sequence spans nt 601-778. The three exons are shown as open boxes. Vertical dashed lines in A indicate, from left to right, the beginning of exon 1, the beginning of the protein-encoding region, the end of the protein-coding region, and the end of exon 3. Vertical dashed lines in B show where introns have been removed.

FORS-D Values of Intron-Containing Genes

To seek traces of this postulated evolutionary conflict between a genome-wide FORS-D pressure and local pressures related to individual gene function, the folding patterns of some modern genes were examined. Fold energy plots were prepared using consecutive 200-nt windows with 150-nt overlaps.

The human cytokine-encoding gene G0S19-1 is one of a series of potential lymphocyte G0/G1 switch regulatory genes (" G0S genes ") which have been sequenced in my laboratory (Blum et al. 1990 [ Now more widely known as MIP1-alpha, whose receptor is a coreceptor for HIV-1 ]). The gene has three exons distributed over a region of 1,886 nt, and the cDNA derived by splicing is 778 nt in length [ See diagram at top of page ]. As noted above (fig. 1), FONS and FORS-M profiles tend to resemble each other (fig. 3A, lower panel). FORS-D values are generally positive in the gene however, distinct local decreases occur in certain regions (fig. 3A, upper panel). When the introns are removed to generate the cDNA, decreased FORS-D values tend to be concentrated together, and general positivity is less evident (fig. 3B, upper panel). An extremely low FORS-D value in the immediate 5'flank of the gene corresponds to a region rich in potential regulatory motifs, which is highly conserved between G0S19-1 and its murine homolog (Russell and Forsdyke 1993).

This supports the idea of a conflict between a general evolutionary pressure for stem-loop potential and local pressure for function. The very high positive FORS-D values (>20 kcal/mol) in the 5'flank of the G0S19-1 gene correspond to a potentially " foreign " AT-rich minisatellite-like element containing four 22-nt tandem repeats and two inverted repeats. (Only 1 of the 22-nt repeats is present in the G0S19-2 gene. Thus an amplification/contraction probably occurred following the divergence of the two genes Blum et al. 1990.) Low FORS-D values in the regions of the second and third exons are also consistent with the conflict hypothesis however, values are high in the signal peptide-encoding first exon (fig. 3A).

FIG. 4 .- Fold energy minimization values (FORS-M, FONS) and differences (FORS-D) for the 6,210-nt sequence containing the human c-fos gene (GenBank name HUMFOS). Details are as in fig. 3.

Another G0S gene (G0S7) corresponds to the oncogene c-fos, which has four exons (fig. 4). Each exon is associated with a region of negative FORS-D value, whereas the three introns tend to have high FORS-D values. The negative values of the exons are not reflected in the cDNA, which shows no consistent tendency to low values (fig. 4B). The coding part of the fourth exon is long (642 nt). Although the 5' part of the exon is associated with a region of low FORS-D value, the rest of the coding region is associated with high FORS-D values. Thus, here there appears to be no conflict between pressures for the evolution of protein-encoding function and for the potential to form stem loops. The 3' part of the 3' non-coding region has relatively low FORS-D values, consistent with the known functional role of this region in controlling mRNA stability.

FIG. 5 .- Linear regression analysis of the FORS-D value against the degree of exon overlap(%) for A, G0S19-1 B, c-fos and C, p53. Each data point represents values for a 200-nt window. Each window overlaps its neighbors by 150 nt. For G0S19-1, windows extend from the beginning of the first exon to the end of the last exon. Windows in the flanks are omitted except those which are close to, and hence partially overlap, the first and last exons. For reasons given in the text, terminal exons are omitted from the analysis of c-fos and p53.

FORS-D Values Correlate Negatively with Exon Overlap

To obtain an objective measure of the tendency of small exons to associate with low FORS-D values, each of the 200-nt windows for which FORS-D values were determined was scored for its percentage of exon overlap. Figure 5A shows a linear regression plot for the three exons of the G0S19-1 gene. Consecutive windows extended from the first 5'window overlapping the first exon to the last 3'window overlapping the last exon. Although there was much scatter of data points, the slope of the least-squares regression line was significantly greater than zero (P = 0.048 that the slope is not greater than zero). Figure 5B and C show similar data for the oncogenes cfos and p53, except that the long last exons were omitted. Both slopes were significantly greater than zero (P = 0.018, c-fos, P= 0.002, p53). The corresponding P values when the last exons were included were 0.752 and 0.048. (A justification for omitting last exons is given later.)

Table 1. Summary of FORS-D plots a and Linear Regression Analyses of FORS-D versus Exon Overlap b

1 2 3 4 5 6 7 8 9 10 11 12 Slope d P
Oncogene p53 HSP53G 8,750 . - + - - - + + (-) - (+) . -0.054* 0.002
Triose phosph. isomerase HSTPI1G 4,400 (-) - (+) (+) (-) (+) (+) . . . . . -0.011 0.710
Albumin HUMALBGC 8,400 (-) (-) - - - + . . . . . . -0.026 0.232
HUMCRYABA 4,200 (-) (+) (+) . . . . . . . . . +0.009 0.616
Cytochrome C HUNCYCAA 3,088 - - (+) . . . . . . . . . -0.033* 0.325
Oncogene c-fos HUMFOS 6,210 - - - (-) . . . . . . . . -0.052* 0.018
Cytokine G0S19-1 HUMG0S19A 4,100 + - (-) . . . . . . . . . -0.041 0.048
Zinc finger protein HUMG0S24B 3,135 - (+) . . . . . . . . . . -0.056* 0.354
Beta-globin HUMHBB 3,100 (-) (-) + . . . . . . . . . +0.008 0.701
Heat shock protein 86A HUMHSP86A 2,598 . + - (+) - + . . . . . . -0.007 0.754
Heat shock protein 90B HUMHSP90B 8,200 (+) - (-) + - (-) - + - (+) - - -0.038 0.001
Troponin-C HUMTROC 4,400 + + - - - + . . . . . . -0.135 <0.001
Tcp-10 MUSTCP10AA 3,882 . - - (+) + + + - (-) + - . -0.066 0.017
Glucose reg. protein TOMBIPGRP 2,966 (-) (+) + (+) (+) . . . . . . . +0.017 0.228
a Exons showing relative decreases in FORS-D values (exemplified by Figs. 3-4) are indicted by minus signs (-). Exons failing to show decreases in FORS-D values are indicated by plus signs (+). Parentheses around signs indicate some ambiguity.
b FORS-D values were plotted against degree of exon overlap (exemplified by Fig. 5). P values indicate the extent to which slopes are not significantly different from zero.
c All exons were studied except HUMALBGC (first six only), and HSP53G, HUMHSP86A, and MUSTCP10AA (not first exon.
d Values with asterisks indicate that the last exon was omitted from the analysis.

The above analyses were extended to a broader spectrum of genes, and the results are shown in table 1. Five of 12 human genes had significant P values (< 0.05 that the slope is not significantly greater than zero). The one mouse gene studied had a significant P value. The one plant gene studied (TOMBIPGRP) did not have a significant P value. The most significant P value was obtained with a gene (encoding troponin C), believed to have been under positive Darwinian selection (Ohta 1994). Table 2 compares average FORS-D values of gene-containing segments (table 1) with those of the corresponding cDNAs. In most cases average FORS-D values for the genomic segments were greater (average +0.71+0.33 P < 0.05 by Student's t-test).

Table 2. Comparison of Average FORS-D Values of Genes and Their Corresponding cDNAs
FORS-D (kcal/mol)
Oncogene p53 4.8+0.4 (172) 4.2+0.7 (47) +0.6
Triose phosphate isomerase 3.7+0.6 (85) 4.5+0.7 (22) -0.8
Albumin 2.3+0.4 (165) 0.4+1.1 (12) +1.9
alpha-B-crystallin 3.3+0.5 (81) 1.8+1.0 (10) +1.5
Cytochrome C 1.5+0.5 (59) 2.2+1.0 (22) -0.7
Oncogene c-fos 1.9+0.5 (120) 3.0+0.8 (39) -1.1
Cytokine G0S19-1 4.2+0.6 (79) 1.8+1.2 (13) +2.4
beta-globin 2.7+0.5 (59) 1.0+1.2 (10) +1.7
Heat shock protein 86A 2.2+0.6 (49) 2.5+1.1 (17) -0.3
Heat shock protein 90B 2.2+0.4 (161) 0.9+0.7 (48) +1.1
Troponin-C 4.4+0.7 (85) 2.3+1.9 (11) +2.1
Mouse TCP-10 3.4+0.6 (75) 2.6+1.0 (24) +0.8
Tomato glucose-regulated
2.3+0.5 (56) 2.2+0.6 (22) +0.1
NOTE: The average FORS-D values (+ standard error) for 200 nt windows were calculated for DNA segments containing the genes shown in table 1 and their corresponding cDNAs. Numbers of windows, each overlapping the preceding window by 150 nt, are shown in parenthesis. All genes are human except the bottom two.

The Competing Needs of the Early Replicators

The winner in the competition between replicator molecules in early evolution would have been the replicator which could most efficiently balance the needs for:

(1) rapid replication ,

(2) accurate replication , and

(3) stability .

would have been another property which successful replicators would have optimized at an early stage.

If Tomizawa's stem-loop " kissing " model is applicable to the early RNA world, then it can be further envisaged that replicators which modified their sequences to increase the probability of stem-loop formation (and hence of recombination), would have had a survival advantage. To reap the benefits of efficient recombination a replicator would have had to exchange segments with its own kind of replicator , not with other kinds. Thus need 4 implies a fifth need,

(5) the need to distinguish "self" from "not-self."

This discussion deals with potential conflicts between needs 4 and 6, and needs 4 and 5, and suggests how these conflicts might have been resolved.

Evidence for a Stem-Loop Model

It can be imagined that as part of the attempt to improve replicator stability (need 3), RNA molecules would have been superseded by DNA molecules, but these would have retained the potential to form stem loops. This leads to the prediction explored here that, where not constrained by other needs, modern DNA molecules would have a special potential to form stemloop structures. This would be particularly evident during meiosis (Kleckner and Weiner 1993) and might require an exceptional degree of supercoiling for DNA to depart from its classical duplex structure (Murchie et al. 1992). Consistent with this, inhibitors of topoisomerases, which relax supercoiled DNA, greatly enhance recombination (Wang et al. 1990). Furthermore, endonucleases involved in recombination, which make single-strand nicks in double-stranded DNA, require a supercoiled substrate (Sung et al. 1993). Although a relatively crude measure, there is evidence from studies with anticruciform antibodies for the presence of stem-loop structures, which can vary with the physiological state of the cell (Ward et al. 1990). Further evidence for a role of stem loops in recombination is presented elsewhere (Romanova et al. 1986 Reed et al. 1994).

Positive FORS-D values indicate the existence of an evolutionary pressure on DNA base order , rather than on base composition , to generate the potential to form stem-loop structures. Thus, in the absence of other constraints, base changes which enhance the formation of stem loops would have been accepted. I have reported here that dispersed positive FORS-D values are found in long human genome segments considered to consist mainly of nontranscribed intergenic DNA (fig. 1). This is found with DNA from various species, including some with much less intergenic DNA than humans (Forsdyke 1995b). The most obvious genome-wideselective force driving evolution of the potential to form stem loops is recombination.

Conflict Hypothesis and the Evolution of Introns

A potential conflict between the genome-wide FORS-D pressure and pressures on DNA to encode specific functions is apparent (fig. 2). A sequence required to encode a protein might not at the same time be able locally to optimize its folding propensity. The conflict can be met in three ways.

First, because of the redundancy of the genetic code, particular synonymous codons can be preferred.

Second, amino acids with similar functions (e.g., serine and threonine) can be interchanged to widen the range of codon choices.

Third, the sequences encoding a protein can be diffused over a wide region by permitting encoding to occur only in discrete segments.

If the first two options are not sufficient, then only the third option is left. Thus introns might correspond to parts of a gene where the constraints on the first two options are most severe. The more severe the constraints, the more introns there would be, and the longer would be the length of DNA occupied by the gene .

This conflict hypothesis might explain how introns initially arose. However, it does not appear to address the problem of how introns came to vary so dramatically in length (Hawkins 1988). Variation in intron length would tend to prevent recombination between genes with homologous exon sequences by impairing the precise register required for successful " kissing " between the loops of stem-loop structures.

In terms of FORS-D values, the hypothesis states that when exons first arose, the exon sequences themselves would have had low FORS-D values, except where, by chance, the need to encode a peptide happened not to be in conflict with the need to form stem loops. The low FORS-D exons would have been surrounded by high FORS-D introns and flanking sequences. Over evolutionary time the demands of regulation of gene expression would have been imposed upon this primitive arrangement, so that low FORS-D values might be present in regions of DNA where regulatory proteins bind (except, perhaps, where the regulatory proteins recognize DNA palindromes). Also, over evolutionary time a background " noise " might have been imposed by mobile genetic elements and recombinations accompanying exon shuffling (Gilbert and Glynias 1993).

Taking into account the proposed long evolutionary period since introns were established in genes, the analyses of FORS-D values in various intron-containing genes (figs. 3-4 table 1) do provide some support for the idea of a conflict between genome-wide and local pressures. Further evidence has been provided by recent studies of snake venom phospholipase A2 genes, retroviral genomes, and major histocompatibility complex peptide-binding genes (Forsdyke 1995a, and unpublished data [see below] ). All of these are under greater selection pressure (positive selection) than the genes studied here (with the possible exception of genes encoding troponin C Ohta 1994).

Some terminal long open-reading frames were excluded from the linear regression analyses shown in figure 5 and table 1. In terms of the hypothesis presented here, long open-reading frames would onlyexist in certain genes becausethose genes had been able to resolve the conflict in the regions where we now see the long open-reading frames FORS-D pressure would have been accommodated by the choice of appropriate synonymous codons or of amino acids with similar functions. (These would have been the main choices open to organisms without introns.) Thus, FORS-D analyses of genes with long open-reading frames are not inconsistent with the conflict hypothesis (fig. 4). By omitting unduly long exons from regression analyses, one hopes to better assess genomic regions which are less able to adapt to FORS-D pressure by using synonymous codons or functionally similar amino acids.

The conflict hypothesis would predict that cDNA sequences would tend to have lower average FORS-D values than the genes from which they derive. However, it does not follow that low FORS-D exons, when united, will alwaysresult in a low FORS-D cDNA. Two independently arising exons may each have a low potential to form stem loops but a high potential to form stem loops collectively. As a simple example, consider two exons of sequence CACACACA and TGTGTGTG. Both have low FORS-D values, but they would unite to generate a cDNA with a very stable stem loop (high FORS-D value). Only some genes show clear evidence for higher average FORS-D values than the corresponding cDNAs (G0SI 9-1, fig. 3 table 2).

Why Are FORS-D Values So Sensitive?

Evolutionary pressures (mutation, selection, drift) determine how many bases there are in a sequence (DNA length), the proportions of different bases (base composition), and the order of the bases (sequence). The experimental shuffling of the order of bases in a sequence of given length and base composition disrupts information present in the natural sequence and generates a reference sequence of the same length and base composition against which the natural sequence can be compared. Thus one can distinguish evolutionary pressures affecting base order (primary sequence) from those affecting base composition.

FORS-M and FONS profiles closely follow each other (figs. I B, 3B-4B), showing that base composition is the major determinant of the stability of stem loops . It thus might seem somewhat surprising that differences of only a few kilocalories (FORS-D values) provide such a sensitive indicator. That fluctuations in FORS-D values are not random is shown by their consistent average positivity (fig. 1) and the association of negative FORS-D values with functionally important regions of DNA (fig. 3).

Base composition is a genome"strategy" (Grantham et al. 1980) and thus is less likely to be relevant to localgene function than primary sequence. FORS-D values provide a sensitive indicator because they measure the base order -determined stem-loop potential, not the base composition -determined stem-loop potential. A close examination of folding energy profiles reveals that highFORS-D values may sometimes be associated with quite low negative FONS values (a function of base order and base composition) and even lower negative FORSM values (a function of base composition). Thus, high negative FONS values (implying a very stable secondary structure) may sometimes have little local functional relevance.

The recombination model of Wagner and Radman (1975) proposes that, after initial contact is made by loops at the tips of the stems, there are endonucleolytic cleavages of single strands of opposing stems. Single strand exchanges between the stems then follows. This process would presumably abort if precisecomplementarity were not present in the stems (Radman and Wagner 1993). Furthermore, Tomizawa (1993) has concluded that the major role of the stem in a stem-loop structure is the proper positioning of the loop. This allows the unpaired bases in the loop to pair, in register, with those of an appropriately positioned loop projecting from a complementary nucleic acid. This " kissing " is rate limiting in recombination. Thus, a departure, even by a few kilocalories, in complementarity between bases in the stem might prevent recombination. An evolutionary force adapting base order to favor stem-loop formation could be a very powerful one.

In that the ability to recombine is held to have evolved before the ability to encode proteins, this paper supports the " introns early " model. The positions of exon-intron junctions are held to have been determined by the need to form stem loops and are not necessarily related to protein domains (Gilbert and Glynias 1993). Stoltzfus et al. (1994, p. 202) concluded that " no significant correspondence between exons and units of protein structure was detected ."

GC/AT Fine-Tuning and Speciation

Different genomes, or genome compartments, have characteristic (G+C)/(A+T) ratios (Wyatt 1952 Bernardi 1989 Filipski 1990). In vertebrates this "GC/AT pressure" (fig. 2) is evident in intergenic regions, and introns, and in both coding and noncoding parts of exons. The " genome hypothesis " (Grantham et al. 1980) has led to the proposal that there is some intrinsic genome-wide pressure favoring the adoption of a particular (G+C)/(A+T) ratio in a particular species. The view that this primarily reflects mutational biases (Filipski 1990) has been challenged (Bernardi 1989). I argue elsewhere (unpublished data), that species-specific settings of the (G+C)/(A+T) ratio have arisen as part of a fine-tuning process which prevents recombination between species. Just as different radio transmitters broadcast their messages at different wavelengths to avoid interference, so different genomes " broadcast " their sequences at different (G+C)/(A+T) " wavelengths ," to avoid undesirable recombination events.

The present work suggests a mechanism by which this could occur. Base composition, rather than base order, is the major factor determining the folding energy of a nucleic acid segment (figs. 1B, 3B-4B). Thus small changes in base composition could greatly affect the looping pattern which a sequence could present for homology search. It would be more difficult to extrude loops from GC-rich DNA than from AT-rich DNA.

Two sequences of different G+C percentages undergoing supercoiling in a common intracellular environment might extrude stem loops at different times and to different extents. The pattern of loops presented by homologs would be different, and impaired recombination would result. To recombine, two sequences must be equal both in the local parameter (base order determined stem-loop potential) and in the genomic parameter (base composition determined stem-loop potential).

If avoidance of recombination between different species (including a host species and its pathogen species) is evolutionarily advantageous (Bernstein and Bernstein 1991), then there should have been a selection pressure such that each species in a common environment with other species would have fine-tuned its (C+G)/(A+T) ratio to a level distinct from those of the other species. Thus, biologically similarspecies may have dissimilar ratios, whereas biologically dissimilarspecies (unlikely to interact sexually and with dissimilar pathogens) may have similar ratios (Wyatt 1952). The fine-tuning of base ratios could have been a key component of the postzygotic isolation process leading to speciation (unpublished data [see Evolution Web Page] ).

I thank D. Back for assistance in computer configuration and M. Go, C. Tittiger, V. K. Walker, and G. R. Wyatt for helpful comments on the manuscript. The work was supported by a grant from the Medical Research Council of Canada.

BERNARDI, G. 1989. The isochore organization of the human genome. Annu. Rev. Genet . 23:637-661.

BERNSTEIN, C., and H. BERNSTEIN. 199 1. Aging, sex and DNA repair . Academic Press, San Diego.

BLUM, S., R. E. FORSDYKE, and D. R. FORSDYKE. 1990. Three human homologs of a murine gene encoding an inhibitor of stem cell proliferation. DNA Cell. Biol . 9:589-602.

CRICK, F. 197 1. General model for the chromosomes of higher organisms. Nature 234:25-27.

FILIPSKI, J. 1990. Evolution of DNA sequence: contributions of mutational bias and selection to the origin of chromosomal compartments. Adv. Mutagenesis Res . 2:1-54.

FORSDYKE, D. R. 1995a. Conservation of stem-loop potential in introns of snake venom phospholipase A2 genes: an application of FORS-D analysis. Mol. Biol. Evol . 12:1157-1165.

FORSDYKE, D. R. 1995b. Relative roles of primary sequence and (G+C)% in determining the hierarchy of frequencies of complementary trinucleotide pairs in DNAs of different species. J. Mol. Evol. 41:573-581.

GILBERT, W., and M. GLYNIAS. 1993. On the ancient nature of introns. Gene 135:137-144.

GRANTHAM, R., G. GAUTIER, M. GOUY, R. MERCIER, and A. PARE. 1980. Codon catalog usage and the genome hypothesis. Nucleic Acids Res . 8: r49-r62.

GRIBSKOV, M., and J. DEVEREUX. 1991. Sequence analysis primer . Stockton, New York.

HAWKINS, J. D. 1988. A survey of intron and exon lengths. Nucleic Acids Res. 16:9893-9905.

HAWLEY, R. S., and T. ARBEL. 1993. Yeast genetics and the fall of the classical view of meiosis. Cell 72:301-303.

KLECKNER, N., and B. M. WEINER. 1993. Potential advantages of unstable interactions for pairing of chromosomes in meiotic, somatic and premeiotic cells. Cold Spring Harbor Symp. Quant. Biol . 58:553-565.

LE, S.-Y., and J. V. MAIZEL. 1989. A method for assessing the statistical significance of RNA folding. J. Theor. Biol . 138: 495-510.

MURCHIE, A. 1. H., R. B0WATER, F. ABOUL-ELA, and D. M. J. LILLEY. 1992. Helix opening transitions in supercoiled DNA. Biochem. Biophys. Acta 1131:1-15.

OHTA, T. 1994. Further examples of gene duplication revealed through DNA sequence comparisons. Genetics 138:1331-1337.

ORITA, M., H. IWAHANA, H. KANAZAWA, K. HAYASHI, and T. SEKIYA. 1989. Detection of polymorphisms of human DNA by gel electrophoresis as single strand conformation polymorphisms. Proc. Natl. Acad. Sci. USA 86:2766-2770.

PRABHU, V. V. 1993. Symmetry observations in long nucleotide sequences. Nucleic Acids Res . 21:2797-2800.

RADMAN, M., and R. WAGNER. 1993. Mismatch recognition in chromosomal interactions and specification. Chromosoma 102:369-373.

REED, K. M., L. W. BEUKEBOOM, D. G. EICKBUSH, and J. H. WERREN. 1994. Junction between repetitive DNAs on the PSR chromosome of Nasonia vitripennis.- association of palindromes with recombination. J. Mol. Evol . 38:352-362.

ROMANOVA, L. 1., V. M. BLINOV, E. A. TOLSKAYA, E. G. VIKTOROVA, M. S. KOLESNIKOVA, E. A. GUSEVA, and V. 1. AGOL. 1986. The primary structure of crossover regions of intertypic poliovirus recombinants: a model of recombination between RNA genomes. Virology 155:202-213.

RUSSELL, L., and D. R. FORSDYKE. 1993. The third human homolog of a murine gene encoding an inhibitor of stem cell proliferation is truncated and linked to a CpG island-containing upstream sequence. DNA Cell. Biol . 12:157-175.

RYAN, B. F., and B. L. JOINER. 1994. Minitab handbook . 3d ed. Wadsworth, Belmont, Calif

SOBELL, H. M. 1972. Molecular mechanism for genetic recombination. Proc. Natl. Acad. Sci. USA 69:2483-2487.

STOLTZFUS, A., D. F. SPENCER, M. ZUKER, J. M. LOGSDON, and W. F. DOOLITTLE. 1994. Testing the exon theory of genes: the evidence from protein structure. Science 265: 202-207.

SUNG, P., P. REYNOLDS, L. PRAKASH, and S. PRAKASH. 1993. Purification and characterization of the Saccharomyces cerevisiae RADI/RADIO endonuclease. J. Biol. Chem . 268: 26391-26399.

TOMIZAWA, J. 1984. Control of ColE I plasmid replication: the process of binding of RNA I to the primer transcript. Cell 38:861-870.

TOMIZAWA, J. 1993. Evolution of functional structures of RNA. Pp. 419-445 in R. F. GESTELAND and J. F. ATKINS, eds. The RNA world . Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y.

TURNER, D. H., and P. C. BEVILACQUA. 1993. Thermodynamic considerations for evolution by RNA. Pp. 447-464 in R. F. GESTELAND and J. F. ATKINS. eds. The RNA world . Cold Spring Harbor, Cold Spring Harbor, N.Y.

TURNER, D. H., N. SUGIMOTO, and S. M. FREIER. 1988. RNA structure prediction. Annu. Rev. Biophys. Chem. 17:167192.

WAGNER, R. E., and M. RADMAN. 1975. A mechanism for initiation of genetic recombination. Proc. Natl. Acad. Sci. USA 72:3619-3622.

WANG, J. C., P. R. CARON, and R. A. Kim. 1990. The role of DNA topoisomerase in recombination and genome stability: a double-edged sword? Cell 62:403-406.

WARD, G. K., R. McKENZIE, M. ZANNIS-HADJOPOULOS, and G. B. PRICE. 1990. The dynamic distribution and quantification of DNA cruciforms in eukaryotic nuclei. Exp. Cell Res . 188:235-246.

WYATT, G. R. 1952. The nucleic acids of some insect viruses. J. Gen. Physiol. 36:201-205.

ZUKER, M. 1989. Computer prediction of RNA secondary structure. Meth. Enzym . 180: 262-289.

Shen, L. X., Basilion, J. P. & Stanton, V. P. (1999) Single nucleotide polymorphisms can cause different structural folds of mRNA. Proc. Natl. Acad. Sci. USA 96, 7871-6.

Conservation of Stem-Loop Potential in Introns of Snake Venom Phospholipase A2 Genes. An Application of FORS-D Analysis

D. R. Forsdyke (1995b). Molecular Biology and Evolution12, 1157-1165.
[JAN KLEIN, reviewing editor Received February 16, 1995 Accepted May 19, 1995. With copyright permission from the Editor, Simon Easteal]

Keywords: recombination, introns early theory, intron conservation, exon variation, snake venom, phospholipase A2, FORS-D analysis.

Abbreviations: FONS = folding of natural sequence FORS-M = folding of randomized sequence mean FORS-D = folding of randomized sequence difference indel = insertion and/or deletion PLA2 = phospholipase A2.

    (i) use of synonymous codons,

    (i) The use of those synonymous codons which best facilitate stem-loop formation.

The latter " introns early " theory stated that the early introns would have been regions of high base order-determined stem-loop potential, and predicted that traces of this primitive arrangement should be detectable in some modern genes. To examine this, FORS-D analysis was applied to various gene sequences. A tendency for base order-determined stem-loop potential to localize to introns was apparent, but in several cases the potential was equally apparent in exons and introns (Forsdyke, 1995a). What was needed was a gene that had been under such extreme selection pressure for function that accommodation of the potential in exons would have been unlikely.

The snake venom phospholipase A2 (PLA2) genes appear ideal for this purpose. Snakes are engaged in an intense " arms-race " with their prey (or predators), which can acquire venom resistance (Harding and Welch, 1980). Recently, the genomic sequences of PLA2 genes have become available from habu snakes (Nakashima et al. 1993), and rattlesnakes (John et al. 1994). In keeping with an intense positive selection pressure for change, substitution rates obtained by comparing homologous exons are high, whereas introns sequences are remarkably conserved. In exons, non-synonymous substitutions exceed synonymous substitutions.

The present paper describes the application of FORS-D analysis to these sequences. Contrary to the assertion that for venom PLA2 genes " the regions corresponding to introns of precursor RNAs involved no significant secondary structure " (Nakashima et al. 1993), FORS-D analysis indicates an evolutionary pressure to conserve intron sequences in order to maintain the potential for secondary structure, and hence the potential to undergo recombination.

FORS-D analysis. This evaluates base order-determined stem-loop potential, which closely corresponds to the " statistically significant " stem-loop potential of Le and Maizel (1989). The theory and application of the method were described previously (Forsdyke, 1995a).

Determination of substitution density . Sequences of different PLA2 genes were aligned with the program GENALIGN using the " regions " method, a matching weight of 1.0, and a deletion weight (gap penalty) of -0.5 (Martinez, 1988). The program was accessed through the Bionet on-line computing service (IntelliGenetics Inc., Mountain View, California). Base substitutions and " indels " (insertions or deletions) were counted in successive 200 nt windows, each of which overlapped the preceding window by 150 nt. Substitutions dominated the alignments and changes in deletion weight over a range from -0.25 to -1.0 caused few interchanges between substitutions and indels. Thus, the general patterns of substitution and indel densities were independent of deletion weight. No corrections were made for the possibility of multiple substitutions.

High substitution density in exons. Many species have multiple PLA2 genes, which usually consist of four exons (Davidson and Dennis, 1990). In snakes a subset of these genes encodes proteins which contribute to the toxicity of venom. Intraspecies comparisons of base substitutions in snake venom phospholipase genes show high substitution densities in exons, and low substitution densities in introns . This applies to the habu snake (Nakashima et al, 1993), and to the rattlesnake (John et al. 1994).

FIG. 1.-Distribution of (A) substitutions and (B) indels in rattlesnake PLA2 acidic subunit gene U01026, relative to rattlesnake PLA2 basic subunit gene U01027 (circles), habu snake PLA2 gene TFLPA2P1 (squares), and habu snake PLA2 gene TFLPA2P2 (triangles). Substitutions and indels are counted in successive 200 nt windows, each of which overlaps the preceding window by 150 nt. Data points at 50 nt intervals each correspond to the centre of a window. The positions of the four exons are shown as boxes in B. Vertical dashed lines indicate exon borders, the beginning of the protein-coding part of exon 1, and the end of the protein-coding part of exon 4.

Figure 1a demonstrates this, both within species, and between species. Substitutions are scored for successive 200 nt windows which overlap by 150 nt. Relative to the rattlesnake PLA2 acidic subunit gene (GenBank designation U01026), the rattlesnake PLA2 basic subunit gene (GenBank designation U01027), and two habu snake PLA2 genes (GenBank designations TFLPA2P1 and TFLPA2P2), show many substitutions in all but the first exon, most of which does not code for protein. In the fourth exon, substitution density is greatest in the protein-encoding region. Because some of the 200 nt windows, centred in introns, overlap exons, sharp demarkations between the low substitution density introns and high substitution density exons, are not apparent. However, the relatively low intron substitution density is not uniform in the second intron here the density increases in the 3' half of the intron. Essentially similar density patterns are obtained when different venom PLA2 genes are taken as a basis for comparison (e.g. TFLPA2P2 compared with U01026, U01027 and TFLPA2P1 data not shown). Thus, although the particular bases substituted tend to be characteristic of a particular pair of genes, the overall substitution patterns are unlikely to be random.

Figure 1b shows similar data for base insertions and deletions (" indels "). Indels are in both coding and non-coding regions. Major peaks occur at the 3' end of the second intron in the cases of the U01026/U1027 and U1026/TFLPA2P2 comparisons. In the former case the peak is associated with a 38 nt CG-rich sequence in U01026, which is also present in the habu snake genes, but not in U01027. In the latter case the peak is associated with a microsatellite repeat (CA)n in TFLPA2P2, which is not so evident in the other snake venom PLA2 genes. The indel density patterns shown in figure 1b are essentially similar to those obtained when different venom PLA2 genes are taken as a basis for comparison (data not shown). Thus, although the actual bases inserted or deleted tend to be characteristic of a particular pair of genes, the overall patterns are unlikely to be random.

Reciprocal relationship between substitution density and FORS-D. The 200 nt windows used to characterize substitution and indel densities (fig. 1), were subjected to the program FOLD (Zuker, 1989) to determine the energetically most favourable folded structures.

FIG. 2.-Comparison of distribution of values for FONS (closed circles), FORS-M (open triangles) and FORS-D (closed triangles), with distribution of base substitutions (open circles) and indels (open squares), for the rattlesnake PLA2 acidic subunit gene U01026 (A, B). Values are for the same 200 nt windows as shown in figure 1. Substitutions and indels are relative to the rattlesnake PLA2 basic subunit gene U01027. Boxes in B indicate the location of the four exons, with four fine vertical lines showing the beginning of exon 1, the beginning of the protein-coding part of exon 1, the end of the protein-coding part of exon 4, and the end of exon 4. On the right (C,D) are shown FONS, FORS-M and FORS-D values for the cDNA. Vertical lines indicate splice sites (exon-intron junctions).
FIG. 3.-Comparison of distribution of values for FONS, FORS-M and FORS-D, with distribution of base substitutions and indels, for the rattlesnake PLA2 basic subunit gene U01027. Substitutions and indels are relative to the rattlesnake PLA2 acidic subunit gene U01026. Symbols and other details are as in figure 2.

In figures 2b (rattlesnake gene U01026) and 3b (rattlesnake gene U01027), the minimum free energies associated with each structure (FONS values) are shown together with the mean values for ten randomized versions of the same sequence windows (FORS-M values). In keeping with previous observations on a variety of genes (Forsdyke, 1995a), FONS values tend to be more negative than FORS-M values (indicating greater stability of the secondary structure of the natural sequence). This is particularly apparent in the 5' non-coding part of exon 1, in the first part of intron 2, and in intron 3. The differences (FORS-M less FONS) are expressed in figures 2a and 3a as the FORS-D (" folding of randomized sequence difference ") values. Positive FORS-D values reflect a contribution of base order (primary sequence), rather than of base composition, to the stability of the optimum structure . The previously noted tendency for protein-encoding parts of exons to have zero or negative FORS-D values, and for introns to have positive FORS-D values (Forsdyke, 1995a), is generally apparent. In the case of U01026 the first intron has low values, but this is not seen in the case of U01027. In the case of U01027 the coding part of the fourth exon corresponds to a relative decline in FORS-D value, but remains positive. It should be noted that high FORS-D values, reflecting local adaptation of base order for the generation of secondary structure, are not necessarily correlated with high FONS values (e.g. high FORS-D values occur in the 3' half of the 2nd intron and in the 3' non-coding part of the fourth exon).

Figures 2c,d and 3c,d show FONS, FORS-M and FORS-D values for U01026 and U01027 (rattlesnake) cDNAs. Fine vertical lines indicate sites of intron removal. FORS-D values are low in the regions corresponding to the middle two exons, but increase in the 5' and 3' exons, which contain extensive non-coding regions.

Figures 2a and 3a also show substitution and indel densities taken from figure 1. There is an approximately reciprocal relationship between FORS-D values and substitution density. Where FORS-D values are low (coding parts of exons), substitution densities are high. Where FORS-D values are high (introns 5' and 3' non-coding regions), substitution densities are low. However, the 3' end of the 2nd intron, where indel density is highest, is characterized by relatively high values both for FORS-D and substitution density. Similar results were obtained when the other venom PLA2 genes under study were compared in this manner.

More objective evidence for a relationship between secondary structure and substitution density was provided by plotting individual window values for FONS, FORS-M and FORS-D, against the corresponding substitution densities. For each pair of genes three linear regression plots were obtained.

FIG. 4.-Linear regression analysis of the relationship between FONS, FORS-M and FORS-D values (kcal/mol), and substitution density (base exchanges/200 nt window) in snake venom PLA2 genes. For the left column of figures (A,B,C), data are from figure 2 (FONS, FORS-M and FORS-D values for rattlesnake PLA2 acidic subunit gene U01026, and substitutions relative to rattlesnake PLA2 basic subunit gene U01027). The middle column (D,E,F) shows FONS, FORS-M and FORS-D values for rattlesnake PLA2 acid subunit gene U01026, and substitutions relative to habu snake PLA2 gene TFLPA2P2. For the right column (G,H,I) data are from figure 3 (FONS, FORS-M and FORS-D values for rattlesnake PLA2 basic subunit gene U01027, and substitutions relative to rattlesnake PLA2 acidic subunit gene U01026). Parameters of the least-squares line shown in each figure are, slope (s), the correlation coefficient (r), and the probability that the slope of the line is not significantly different from zero (P).

Figure 4 shows results for three gene pairs (U01026/U01027, U01027/U01026, U01026/TFLPA2P2). In general, plots for FORS-M are horizontal, plots for FONS slope down slightly and plots for FORS-D slope down more steeply. Within each figure are listed values for the slope (s), correlation coefficient (r), and the probability (P) that the slope is not significantly different from zero. The latter two values are most significant (e.g. high values for r and low values for P) in the case of plots of FORS-D against substitution density, and least significant in the case of plots of FORS-M against substitution density. The results of similar plots for other gene combinations are shown in Table 1. Although values for some gene combinations have relatively low r values and high P values (e.g. figs. 4g,h,i), direct inspection of the corresponding plots of values as a function of sequence position (e.g. fig. 3a) tends to support the generalization that, in most parts of a gene, there is a reciprocal relationship between substitution density and FORS-D value .

Although unusual, high conservation of non-coding sequences in cDNAs, to an extent exceeding that of neighbouring coding sequences, has been observed both within species (Blum et al. 1990), and between species (Duret et al. 1993). Ogawa and coworkers (1992) compared PLA2 cDNAs of the habu snake (Trimeresurus flavoviridis), which were found to have extremely high conservation of 5' and 3' non-coding sequences (98% and 89% identities, respectively). They correlated this with high stem-loop potential, and suggested an important functional role at the translation level. They also noted high conservation of the nucleotide sequence corresponding to the 16 amino acid signal peptide. However, the remaining protein-encoding sequence showed low conservation (67% identity), mainly due to nucleotide substitutions. An unusually high proportion of these were non-synonymous, thus changing the protein sequence. These observations were consistent with strong selection pressures both to conserve the signal peptide function, and to change the function of the mature protein (122 amino acids). The change would be required to counter adaptations to resistance of the prey. The signal peptide would, of course, be discarded and would not interact with the prey. PLA2 generates a cytotoxic product (Takeda et al. 1982), and reacts with broad specificity binding proteins (Lambeau et al. 1994), which might adapt to generate resistance.

    (i) serine proteinase inhibitor genes , the products of which may need to counter adaptations to resistance of parasite proteases (Hill and Hastie, 1987),

When habu snake PLA2 genomic sequences became available, Nakashima and coworkers (1993) found that intron sequences were as highly conserved as the 5' and 3' non-coding regions . This made explanations of functional effects at the translation level unlikely, and raised the possibility of effects of nucleic acid secondary structure at the genomic level, or at the level of RNA processing. However, Nakashima and coworkers found no secondary structure when the sequences were analyzed by the method which had been used to show secondary structure in 5' and 3' non-coding regions (Zuker and Steigler, 1981 Ogawa et al. 1992).

The observations on habu snake venom PLA2 genes were also found to apply to the venom PLA2 genes of the Mojave rattlesnake (Crotalus scutulatus scutulatus). John and coworkers (1994) pointed out that " the greater intron identity appears to violate evolutionary dogma, which predicts greater divergence in the non-coding regions ". They suggested a critical role of introns in pre-mRNA stabilization or processing.

The present work began by demonstrating similar patterns of conservation, whether comparisons were made within species or between species (fig. 1). The patterns were similar for different gene pairs, but the actual changes tended to be characteristic of a particular gene pair. Thus, although gene duplication generating multiple venom PLA2 genes may have occurred in an ancestor prior to the divergence of the habu snake and rattlesnake lineages (Davidson and Dennis, 1990), most sequence differentiation is likely to have occurred within species, after the divergence. Similarity of patterns of different species indicates similar selective constraints.

Direct observations of the folding into stem-loop structures of sections of the sequences (FONS values) indicated some increased secondary structure potential in introns, relative to exons (figs. 2b, 3b). This was particularly apparent in the third intron. The failure of others to note this may reflect the use in the present work of an improved folding algorithm (Zuker, 1989), and better energy values for base stacking and loop destabilization (Turner et al. 1988).

However, FONS values reflect the contribution of both base order and base composition. Base composition probably serves genome-wide or genome-sector specific functions, rather than gene-specific functions, as discussed elsewhere (Forsdyke, 1995a-c). Evolutionary effects on the base order became apparent when the negative FONS values (greater negativity reflecting more stable structures) were subtracted from the negative FORS-M values (which reflect the contribution of base-composition alone). This produced FORS-D values. Positive FORS-D values, indicating the acceptance of mutations which contribute to the stability of secondary structure by changing base order, were generally present both in introns and in 5' and 3' non-coding regions. Thus, the structures in these regions are likely to be functionally relevant , although they may not necessarily be correlated with high FONS values. Zero and negative FORS-D values, reflecting no acceptance of such mutations, were generally present in the exons encoding the mature PLA2 protein (figs. 2a, 3a). Thus, the conservation of intron sequences could reflect a functional constraint for which base order-determined stem-loop potential is required. The direct relationship between base order-determined stem-loop potential and sequence conservation (i.e. inverse relationship to substitution density), was shown by linear regression analysis (fig. 4 table 1). Similar observations have been made on retroviral (HIV-1) sequences (Le et al. 1988, 1989 Forsdyke, 1995d).

Indels probably play an important role in interfering with the initial alignment (register) of stem-loops in potentially recombining homologous strands (Tomizawa, 1993). Although, within a genome, the introns for different venom PLA2 genes may show great conservation (tending to provoke recombination), indels would make it more likely that recombination would occur only between identical genes. The differences in exon sequences would also ensure that, even if the initial " kissing " interaction between the loops of stem-loops were stable (Kleckner and Weiner, 1993 Tomizawa, 1993), subsequent consummation of the union would be unlikely (Rayssiguier et al. 1989 Radman and Wagner, 1993).

The results presented here are consistent with the new " introns-early " hypothesis (Forsdyke, 1995a). It is becoming increasingly evident that introns interrupt nucleic acid information in general, not just protein-encoding information . Thus, introns are often found in 5' and 3' non-coding regions (Hawkins, 1988). Genes encoding "mRNAs" which have no protein product have been identified the gene sequences are interrupted by introns just like normal protein-encoding genes (Brannan et al. 1990 Brockdorff et al. 1992). The latter observation, and those presented in this paper, reinforce the viewpoint that there is generally no significant association between intron locations and features of protein structure (Stoltzfus et al. 1994).

The results presented here also suggest that it would be of considerable interest to extend the analysis to other genes under positive selection pressure (Hill and Hastie, 1987 Hughes and Nei, 1989), as has been done in the case of retroviral genomes (Forsdyke, 1995d). Indeed, preliminary studies show that regions encoding peptide-binding sites in MHC genes have negative FORS-D values [Forsdyke 1996b unpublished in 1995]. It has been noted that troponin-C genes are under positive Darwinian selection (Ohta, 1994), and that certain exons show very marked negative FORS-D values (Forsdyke, 1995a,c[ =1996a ]). It would also be of interest to compare venom PLA2 genes, with other PLA2 genes which would not be under such pressure.

Acknowledgements. This work was supported by a grant from the Medical Research Council of Canada.

BERNSTEIN, C., and H. BERNSTEIN. 1991. Aging, Sex and DNA Repair . Academic Press, San Diego.

BLUM, S., R. E. FORSDYKE, and D. R. FORSDYKE. 1990. Three human homologs of a murine gene encoding an inhibitor of stem cell proliferation. DNA Cell Biol . 9:589-602.

BRANNAN, C. I., E. C. DEES, D. S. INGRAM, and S. M. TILGHMAN. 1990. The product of the H19 gene may function as an RNA. Mol. Cell. Biol . 10:28-36.

BROCKDORFF, N., A. ASHWORTH, G. F. KAY, V. M. McCABE, D. P. NORRIS, P. J. COOPER, S. SWIFT, and S. RASTAN. 1992. The product of the mouse Xist gene is a 15 kb inactive X-specific transcript containing no conserved ORF and located in the nucleus. Cell 71:515-526.

DAVIDSON, F. F., and E. A. DENNIS. 1990. Evolutionary relationships and implications for the regulation of phospholipase A2 from snake venom to human secreted forms. J. Mol. Evol. 31:228-238.

DURET, L., F. DORKELD, and C. GAUTIER. 1993. Strong conservation of non- coding sequences during vertebrate evolution: potential involvement in post- transcriptional regulation of gene expression. Nucleic Acids Res . 21:2315-2322.

FORSDYKE, D. R. 1991. Early evolution of MHC polymorphism. J. Theor. Biol . 150:451-456.

FORSDYKE, D. R. 1994. Relationship of X chromosome dosage compensation to intracellular self/not-self discrimination: a resolution of Muller's paradox? J. Theor. Biol . 167:7-12.

FORSDYKE, D. R. 1995a. A stem-loop "kissing" model for the initiation of recombination and the origin of introns. Mol. Biol. Evol . 12:949-958.

FORSDYKE, D. R. 1995b. Relative roles of primary sequence and (G + C)% in determining the hierarchy of frequencies of complementary trinucleotide pairs in DNAs of different species. J. Mol. Evol . 41:573-581.

FORSDYKE, D. R. 1995c. Different biological species "broadcast" their DNAs at different (G+C)% "wavelengths". (submitted) --> J. Theor. Biol . (1996a) 178:405-417.

FORSDYKE, D. R. 1995d. Reciprocal relationship between stem-loop potential and substitution density in retroviral quasispecies under positive Darwinian selection.(submitted) --> J. Mol. Evol .(1995) 41:1022-1037.

HARDING, K. A., and K. R. G. WELCH. 1980. Venomous Snakes of the World . Pergamon Press, Oxford.

HAWKINS, J. D. 1988. A survey of intron and exon lengths. Nucleic Acids Res . 16:9893-9905.

HILL, R. E., and N. D. HASTIE. 1987. Accelerated evolution in the reactive centre regions of serine protease inhibitors. Nature 326:96-99.

HUGHES, A. L., and M. NEI. 1989. Nucleotide substitution at major histocompatability complex class II loci: evidence for overdominant selection. Proc. Natl. Acad. Sci. USA . 86:958-962.

JOHN, T. R., L. A. SMITH, and I. I. KAISER. 1994. Genomic sequences encoding the acidic and basic subunits of Mojave toxin: unusually high sequence identity of non-coding regions. Gene 139:229-234.

JOYCE, G. F., and L. E. ORGEL. 1993. Prospects for understanding the origin of the RNA world. Pp 1-25 in R. F. GESTELAND, and J. F. ATKINS, eds. The RNA World . Cold Spring Harbour Laboratory Press, New York.

KLECKNER, N., R. PADMORE, and D. K. BISHOP. 1991. Meiotic chromosome metabolism: one view. Cold Spring Harbour Symp. Quant. Biol . 56:729-743.

KLECKNER, N. and B. M. WEINER. 1993. Potential advantages of unstable interactions for pairing of chromosomes in meiotic, somatic and premeiotic cells. Cold Spring Harbour Symp. Quant. Biol . 58:553-565.

LAMBEAU. G., P. ANCIAN, J. BARHANIN, and M. LAZDUNSKI. 1994. Cloning and expression of a membrane receptor for secretory phospholipase A2. J. Biol. Chem. 269:1575-1578.

LE, S-Y., J-H. CHEN, M. BRAUN, M. GONDA, and J. V. MAIZEL.1988. Stability of RNA stem-loop structure and distribution of non-random structure in the human immunodeficiency virus (HIV-1). Nucleic Acids Res . 16:5153-5168.

LE, S-Y., J-H. CHEN, D. CHATTERJEE, and J. V. MAIZEL. 1989. Sequence divergence and open regions of RNA secondary structures in the envelope regions of the 17 human immunodeficiency virus isolates. Nucleic Acids Res . 17:3275-3288.

LE, S-Y. and J. V. MAIZEL. 1989. A method for assessing the statistical significance of RNA folding. J. Theor. Biol. 138:495-510.

MARTINEZ, H. M. 1988. A flexible multiple sequence alignment program. Nucleic Acids Res. 16:1683-1691.

NAKASHIMA, K-I., T. OGAWA, N. ODA, M. HATTORI, Y. SAKAKI, H. KIHARA, and M. OHNO. 1993. Accelerated evolution of Trimeresurus flavoviridis venom gland phospholipase A2 isozymes. Proc. Natl. Acad. Sci. USA 90:5964-5968.

OGAWA, T., N. ODA, K-I NAKASHIMA, H. SASAKI, M. HATTORI, Y. SAKAKI, H. KIHARA, and M. OHNO. 1992. Unusually high conservation of untranslated sequences in cDNAs for Trimeresurus flavoviridis phospholipase A2 isoenzymes. Proc. Natl. Acad. Sci. USA 89:8557-8561.

OHTA, T. 1994. Further examples of evolution by gene duplication revealed through DNA sequence comparisons. Genetics 138:1331-1337.

RADMAN, M., and R. WAGNER. 1993. Mismatch recognition in chromosomal interactions and speciation. Chromosoma 102:369-373.

RAYSSIGUIER, C., D. THALER, and M. RADMAN. 1989. The barrier to recombination between Echerichia coli and Salmonella typhimurium is disrupted in mismatch-repair mutants. Nature 342:396-401.

ROMANOVA, L. I., V. M. BLINOV, E. A. TOLSKAYA, E. G. VIKTOROVA, M. S. KOLESNIKOVA, E. A. GUSEVA, and V. I. AGOL. 1986. The primary structure of crossover regions of intertypic poliovirus recombinants: a model of recombination between RNA genomes. Virology 155:202-213.

SCPAER, E. G., and J. I. MULLINS. 1993. Rates of amino acid change in the envelope protein correlates with pathogenicity of primate lentiviruses. J Mol Evol 37:57-65

STOLTZFUS, A., D. F. SPENCER, M. ZUKER, J. M. LOGSDON, and W. F. DOOLOTTLE. 1994. Testing the exon theory of genes: the evidence from protein structure. Science 265:202-207.

TAKEDA, A., R. G. E. PALFREE, and D. R. FORSDYKE. 1982. Role of serum in inhibition of cultured lymphocytes by lysophosphatidylcholine. Biochem. Biophys. Acta 710:87-98.

TOMIZAWA, J. 1993. Evolution of functional structures of RNA. Pp. 419-445 in R. F. GESTELAND, and J. F. ATKINS, eds. The RNA World .Cold Spring Harbour Laboratory Press, New York.

TURNER, D. H., N. SUGIMOTO, and S. M. FREIER. 1988. RNA structure prediction. Annu. Rev. Biophys. Chem . 17:167-192.

ZUKER, M. 1989. Computer prediction of RNA secondary structure. Meth. Enzym . 180:262-289.

ZUKER, M., and P. STEIGLER. 1981. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9:133-148.

Stem-loop potential in MHC genes:
a new way of evaluating positive Darwinian selection?

D. R. Forsdyke (1996b) Immunogenetics 43: 182-189.
[ With copyright permission from Springer Inc .]

Abstract The domains of polymorphic MHC proteins which interact with peptides and T cell receptors are considered to have been under positive evolutionary selection pressure. Evidence for this is a high ratio of non-synonymous to synonymous mutations in the corresponding genomic domains. By this criterion snake venom phospholipase A2 genes have also been under positive selection pressure. Recent studies of the latter genes indicate that positive selection has overridden an evolutionary pressure on base order which normally promotes the potential to extrude single strand stem-loops from supercoiled duplex DNA (" fold pressure "). This has resulted in base order-dependent stem-loop potential being shifted to introns, which are highly conserved between species. It is now shown that, like snake venom phospholipase A2 genes, the domains of polymorphic MHC genes which appear to have responded to positive selection pressure have decreased base order-dependent stem-loop potential. The evolutionary pressure to generate stem-loop potential (believed to be important for recombination) has been overridden less in exons under negative " purifying " selection, than in exons under positive " Darwinian " selection. Thus, base order-dependent stem-loop potential shows promise as an independent indicator of positive selection.

Class I and class II major histocompatibility complex (MHC) proteins bind intracellular peptides derived from intracellular and imported extracellular proteins and present them at the cell surface to T cell receptors (Bjorkman and Parham, 1990). Intracellular and extracellular pathogens which mutate their proteins to evade this host immune defence, present an evolutionary " moving target " to which the host species must respond by corresponding mutations in the peptide-recognition domains of MHC proteins (Klein and O'hUigin, 1994). In these regions advantageous mutations are positively selected (" Darwinian selection "), whereas in other regions the sequence is conserved, most mutations being disadvantageous and hence being negatively selected (" purifying selection "). Evaluating which gene and protein domains of a pathogen and its host are under positive or negative evolutionary selection is an important step leading to the identification of possible molecular sites for therapeutic attack.

A widely accepted method of determining if a region of a protein has been under positive selection pressure is to calculate for the corresponding nucleic acid the ratio of non-synonymous mutations (which change the amino acid sequence) to synonymous mutations (which do not change the amino acid sequence). A high proportion of non-synonymous mutations indicates that a region has been under positive selection. This has been shown to apply to the peptide-binding regions of both class I and class II MHC proteins (Hughes and Nei, 1988,1989). The method has also been used to provide evidence for positive selection in the cases of the active sites of protease inhibitors believed to be under the selection pressure of pathogen-derived proteases (Hill and Hastie, 1987), the variable regions of immunoglobulins under selection pressure of extracellular pathogens (Tanaka and Nei, 1989), the surface-exposed loops of pathogen proteins under the selection pressure of host immune defences (Smith et al. 1995), and snake venom phospholipase A2 molecules under the selection pressure of resistant prey/predators. In the latter case the entire secreted protein appears to have been the subject of positive selection (Nakashima et al. 1993).

In such studies, " since synonymous changes do not change the amino acid, synonymous mutations are expected to be neutral, or nearly so " (Hughes and Hughes, 1995). However, there is growing evidence that many " neutral " mutations reflect the outcome of two major evolutionary pressures, " GC/AT pressure ", and " fold pressure " these together generate stem-loop potential (Forsdyke, 1995a-c 1996a,b). The former is assessed as the base composition-determined component ("FORS-M" value), and the latter as the base order-determined component ("FORS-D" value), of the total stem-loop potential of a nucleic acid segment ("FONS" value). The sequence of an exon appears to reflect an evolutionary compromise between these two pressures and the pressure to encode an optimum amino acid sequence (" protein pressure "). In the case of genes under intense positive selection, protein pressure would be expected to dominate, with adverse consequences for the other pressures. This would apply particularly to fold pressure which, in contrast to GC/AT pressure, is a local evolutionary " strategy ", rather than a genome or genome-sector strategy (Forsdyke, 1995a). Recent studies of the distribution of stem-loop potential in snake venom phospholipase A2 genes have shown that fold pressure has been strongly subverted in regions under positive Darwinian selection, suggesting a new approach to the evaluation of positive selection (Forsdyke, 1995b). This paper explores the applicability of the approach to MHC genes.

The theoretical basis of the evaluation of stem-loop potentials in nucleic acid sequences (" FORS-D analysis ") is described elsewhere (Le and Maizel, 1989 Forsdyke, 1995a-c). The Genetics Computer Group implementation of the program RNAFOLD , was accessed as part of the Molecular Biology Data Service of the National Research Council, Ottawa (Zuker, 1989). The program was designed for the study of secondary structure in non-supercoiled single-stranded RNA molecules. It is assumed here that it is applicable to DNA. This assumption is not entirely valid (Breslauer et al., 1986). For example, the 2' hydroxyl group in RNA can contribute of the order of 1 kcal/mol to helix stability. Since the present study was mainly concerned with differences in fold energies, rather than absolute values, differences between RNA and DNA are considered of minor importance. Following the completion of the study, Nielsen and coworkers (1995) made available local data files for folding DNA for use with the program LRNA, which is an advanced version of RNAFOLD (Jaeger et al., 1990). Comparison of the outputs of different programs confirms the assumption that the distribution of fold values along genes shows the same general pattern with different programs when using local data files either for RNA or for DNA (Forsdyke, unpublished work).

RNAFOLD is applied to natural and randomized versions of consecutive overlapping " windows " in nucleic acid sequences to generate, respectively, " folding of natural sequence " (FONS) values, and " folding of randomized sequence " (FORS) values. The FONS value for a window is subtracted from the corresponding mean of several FORS values (FORS-M), to generate the " folding of randomized sequence difference " (FORS-D) value for the window.

The theory of stem-loop potentials (Forsdyke, 1995a-c) predicts that, in local regions of the genome, FORS-M values (the base composition-dependent component of the FONS value) should be constant. However, cyclical fluctuations with a period around 0.6 kb are seen in most genes. While this may reflect some fundamental aspect of chromatin organization, it is possible that low negative FORS-M values, usually corresponding to AT-rich regions, are the first to extrude from negatively supercoiled duplex DNA (Sinden, 1994). These AT-rich regions might be close to the centres of " minimum efficient processing segments " involved in recombination (Radman et al. 1993).

The theory also implies that, whereas zero FORS-D values indicate failure to accommodate to evolutionary pressure for stem formation, negative FORS-D values indicate that a sequence has been modified to oppose stem formation (i.e. promote duplex stability Forsdyke, unpublished work [see Forsdyke 1998] ). This might work at the DNA level to prevent recombination. Alternatively, it might work at the mRNA level to prevent translational delay.

Figure 1 shows an example of FORS-D analysis as applied to an allele of a highly polymorphic human MHC class I gene, the products of which would be expressed on the surface of most nucleated somatic cells.

The lower part of the figure shows the distribution of the FONS and FORS-M values from which the FORS-D values were calculated by subtraction (FORS-M - FONS). As found in most other genes studied (Forsdyke, 1995a), positive FORS-D values predominate, especially in introns and flanking regions. Extremely negative FORS-D values are associated with the first four exons (shown as numbered boxes). The first exon encodes the signal peptide. The second and third exons encode the highly polymorphic a1 and a2 domains of the protein. When inter- and intra-species comparisons are made with other class I genes, these two exons show unusually high proportions of non-synonymous base substitution mutations in regions corresponding to the variable amino acid residues which interact with peptides and T cell receptors (shown as black boxes within the exon boxes Hughes and Nei, 1988 Bjorkman and Parham, 1990).

The fourth exon encodes the non-polymorphic a3 domain, which associates non-covalently with b2-microglobulin. The negative FORS-D region here is not so broadly spread out as in the second and third exons, and includes the sequence encoding the conserved CD8 recognition domain (Salter et al. 1990). Decreased FORS-D values in exons indicate insufficient evolutionary flexibility in usage of synonymous codons and in the exchange of amino acids with similar functions to permit base order to optimize simultaneously both coding potential and stem-loop potential (Forsdyke, 1995a). Decreased FORS-D values are likely to have resulted from positive selection in the case of the second and third exons, and negative (" purifying ") selection in the case of the conserved fourth exon.

Fig. 2. FORS-D analysis of the second exon of the human MHC class I gene shown in Figure 1, using successive 200 nt windows overlapping by 195 nt. Green boxes indicate correspondence with amino acid residues which contact peptide, or the T cell receptor (asterisks Bjorkman and Parham, 1990). Vertical dashed lines indicate exon-intron boundaries.

The values shown in Figure 1 are for consecutive 200 nt windows each of which overlapped the preceding window by 175 nucleotides. Thus a window is displaced by 25 nt from the preceding window. Figure 2 shows data for the 2nd exon alone using windows overlapping by 195 nt. Thus a window is displaced by only 5 nt from the preceding window. The results at this higher level of resolution confirm the tendency for low FORS-D values to associate with regions involved in the binding of peptide and/or T cell receptor (i.e. regions under positive selection pressure).

Fig. 3. FORS-D analysis of an allele of a highly polymorphic mouse MHC class II b chain gene (H2-Ab GenBank MUSMHABZ2). Exons 2 and 3 correspond to extracellular protein domains. Exon 2 is the polymorphic domain which interacts with peptide and the T cell receptor. Exon 4 corresponds to the transmembrane protein domain. Exon 5 corresponds to an intracellular protein domain. Other details are as in Figure 1.
Fig. 4. FORS-D analysis of exon 2 of the mouse MHC class II b chain gene shown in Figure 3. Black boxes indicate regions encoding amino acids which interact with peptide and/or the T cell receptor. The striped box indicates the region encoding amino acids involved in dimer interaction .

Figures 3 and 4 show data similar to those of Figures 1 and 2, for part of a polymorphic mouse MHC class II Ab chain-encoding gene (Hughes and Nei, 1989). In this case only the second exon encodes the variable sequences which interact with peptide and T cell receptor, and this exon shows much lower FORS-D values than the third exon. The association of decreased FORS-D values with the second exon is evident both at " low " resolution (Fig. 3 windows moving in steps of 25 nt), and at " high " resolution (Fig. 4 windows moving in steps of 5 nt). Again, there is a tendency for decreased FORS-D values to associate with regions involved in the binding of peptide and/or T cell receptor.

Table 1 summarizes similar studies with a variety of MHC genes. Average FORS-D values for windows in peptide binding exons are compared with average values for windows in non-peptide-binding exons (the fourth exon in the case of class I genes and the third exon in the case of class II genes). With the exception of the 2nd exon of the pseudogene HLA-H (see later), each peptide-binding exon has a lower average FORS-D value than the corresponding non-peptide-binding exon. Polymorphic exons are considered to have been under more intense positive selection than monomorphic exons, some of which include pseudogenes (Hughes, 1995). Polymorphic exons usually have lower average FORS-D values than monomorphic exons, especially in the case of class II genes. It should be noted that, although sometimes considered monomorphic, the class I gene HLA-E does show some polymorphism (Ohya et al. 1990).

Pseudogenes are presumed to have escaped the selection pressure of pathogens (Hughes, 1995), so that potential peptide-binding exons would be expected to have higher average FORS-D values than peptide-binding exons of genes still under that pressure. This is found in the case of the class I gene HLA-H (also known as HLA-AR and HLA-54). Exon 2 has an average FORS-D value of 2.49 kcal/mol, contrasting dramatically with the corresponding value for the highly polymorphic HLA-A1 gene (-3.04 kcal/mol Table 1). However, another pseudogene (HLA-J, HLA-59) has an average FORS-D value in exon 2 of -3.54. This is largely due to extreme negative values in the domains encoding the b-sheet contact residues of the peptide-binding cleft the domains encoding contact residues associated with a-helices had higher FORS-D values (data not shown). Exon 3 of HLA-J had a relatively high average FORS-D value (-0.62 kcal/mol), consistent with designation as a monomorphic pseudogene.

The average FORS-D value for 200 nt windows in long human genomic segments containing much intergenic DNA is about 4 kcal/mol (Forsdyke, 1995a). Similar values are found in various other species, including those with compact genomes and little intergenic DNA (Forsdyke, 1995c). Thus, positive base order-dependent stem-loop potential, assessed as the FORS-D value, is a widespread and fundamental characteristic of genomes . It is proposed that protein-encoding regions have been under evolutionary pressure to accommodate to the potential by:

(i) the use of synonymous codons ,

(ii) switching between amino acids with similar properties to widen the range of codon choice , and

(iii) decreasing the size of the protein-encoding unit by interspersing introns (Forsdyke, 1995a).

The existence of exons with zero and negative base-order dependent stem-loop potentials indicates impairment of accommodation.

Among the series of genes initially studied, a human troponin-c gene was found to have the most significant association of negative FORS-D values with exons (Forsdyke, 1995a 1996a). Unlike the other genes in that study, the troponic C gene is believed is believe to have been under positive Darwinian selection pressure following gene duplication (Ohta, 1994). Based on the proportions of non-synonymous and synonymous mutations, snake venom phospholipase A2 genes are also believed to have been under positive selection pressure (Nakashima et al., 1993), and also have a significant association of negative FORS-D values with exons (Forsdyke, 1995b). In these two cases entire exons seem to have been involved in positive selection. Is the association of zero or negative FORS-D values with regions under positive selection a general phenomenon? Furthermore, is FORS-D analysis sufficiently sensitive in the case of exons only parts of which have been under positive selection pressure?

The present work shows that when positive selection has affected local sectors of exons, these sectors tend to have low FORS-D values (Figs. 1-4). This suggests that the relationship between decreased base order-dependent stem-loop potential and positive selection is a general one, and that the method is sufficiently sensitive. Studies with retroviral genes, which may also have been under positive selection pressure (Scpaer and Mullins, 1993), further support the generalization (Forsdyke, 1995d).

Regions under negative selection pressure may show decreased base-order stem-loop potential (Forsdyke, 1995a). However, the decrease seems more profound, and more sustained, in sequences which have been under positive, rather than under negative, selection pressure. This was suggested by comparing the troponin C gene with other genes (Forsdyke, 1995a 1996a), and is shown here by comparing, within a gene, exons encoding extracellular protein domains under positive selection and exons encoding extracellular protein domains under negative selection (the fourth exon in the case of class I MHC genes, and the third exon in the case of class II MHC genes Table 1). In general, the results with monomorphic exons, some of which are in pseudogenes (Table 1), are also consistent with decreased selection pressure relative to polymorphic exons (i.e. higher average FORS-D values in monomorphic exons, which have been under less selection pressure Hughes, 1995).

How accurately the present method can distinguish between regions showing high and low proportions of non-synonymous mutations is not known. Preliminary studies with decreased window size indicate that this approach is not likely to increase accuracy. The FORS-D values plotted in Figures 1-4 correspond to the middle of 200 nt windows. Remarkably, although an entire window is involved in the calculation, and the folding may not be symmetrical, the middle of the window seems to correlate best with regions under positive selection.

It is proposed that positive stem-loop potentials have evolved under a selection pressure to improve recombination (Forsdyke, 1995a). The diversification of exons under positive selection pressure, would shift to introns the burden of developing stem-loop potential, so that they would then become highly conserved. Thus, introns are conserved in snake venom phospholipase A2 genes (Nakashima, 1993 Forsdyke, 1995b), in troponin C genes (Schreier et al. 1990), and, less dramatically, in MHC genes (Milner-White, 1984). This suggests that intron conservation may provide another index of positive selection.

The evolution of polymorphic MHC proteins is of much interest (Forsdyke, 1991, 1994, 1995d Klein and O'hUigin, 1994). While this paper is concerned with positive selection of MHC proteins as part of evolutionary arms races between pathogens and their prey, positive selection may also have been important in the development of a new MHC genes following gene duplications (Goodman, 1975). To become fixed, a new gene would be under pressure to mutate:

(i) to change the pattern of stem-loops and thus impede recombination with its partner (synonymous, so-called " neutral " mutations Forsdyke, 1996b), and

(ii) to acquire a new function for which the new sequence would be positively selected (non-synonymous mutations affecting the phenotype).

There could then be a positive correlation between frequencies of non-synonymous and synonymous mutations in a particular gene (Mouchiroud et al. 1995). This would tend to undermine the value of the ratio of non-synonymous to synonymous mutations as evidence for positive selection. Indeed, quite often this index fails for the peptide-binding domains of polymorphic MHC genes, even though the frequency of non-synonymous mutations in peptide-binding domains is much greater than that in the conserved non-peptide-binding domain s (e.g. Ohta, 1995). In this circumstance, detection of decreased stem-loop potential, and some degree of intron conservation, could provide independent supporting evidence for positive selection.

Acknowledgement I thank L. Russell for technical help, J. Gerlach for assistance with computer configuration, J. Mau (NRC, Ottawa) for assistance with the Molecular Biology Data Service, and the Medical Research Council of Canada for support.


Andersson, G., Larhammer, D., Widmark, E., Servenius, B., Peterson, P. A. and Rask, L. Class II genes of the human major histocompatibility complex. Organization and evolutionary relationship of the DRb genes. J. Biol. Chem. 262: 8748-8758, 1987

Bjorkman, P. J. and Parham, P. Structure, function, and diversity of class I major histocompatibility complex molecules. Annu. Rev. Biochem . 59: 253-258, 1990

Breslauer, K. J., Frank, R., Blocker, H. and Marky, L. A. Predicting DNA duplex stability from base sequence comparison. Proc. Natl. Acad. Sci. USA 83: 3746-3750, 1986

Chorney, M. J., Sawada, I., Gillespie, G. A. Srivastava, R. Pan, J. and Weissman, S. M. Transcription analysis, physical mapping and molecular characterization of a non- classical human leukocyte antigen class I gene. Mol. Cell. Biol. 10: 243-253, 1990

Forsdyke, D. R. Early evolution of MHC polymorphism. J. theor. Biol . 150: 451-456, 1991(Click Here)

Forsdyke, D. R. Relationship of X chromosome dosage compensation to intracellular self/not-self discrimination: a resolution of Muller's paradox? J. theor. Biol. 167: 7-12, 1994

Forsdyke, D. R. A stem-loop kissing model for the initiation of recombination and the origin of introns. Mol. Biol. Evol . 12: 949-958, 1995a

Forsdyke, D. R. Conservation of stem-loop potential in introns of snake venom phospholipase A2 genes. An application of FORS-D analysis. Mol. Biol. Evol . 12: 1157-1165, 1995b

Forsdyke, D. R. Relative roles of primary sequence and (G+C)% in determining the hierarchy of frequencies of complementary trinucleotide pairs in DNAs of different species. J. Mol. Evol. 41: 573-581, 1995c

Forsdyke, D. R. Entropy-driven protein self-aggregation as the basis for self/not-self discrimination in the crowded cytosol. J. Biol. Sys . 3: 273-287, 1995d

Forsdyke, D. R. Reciprocal relationship between stem-loop potential and substitution density in retroviral quasispecies under positive Darwinian selection. J. Mol. Evol . 41: 1022-1037,1995d

Forsdyke, D. R. Different biological species "broadcast" their DNAs at different (G+C)% "wavelengths". J. theor. Biol . 178: 405-417,1996a

Girdlestone, J. Nucleotide sequence of an HLA-A1 gene. Nucleic Acids Res . 18: 6701, 1990

Goodman, M., Moore, G. W. and Matsuda, G. Darwinian evolution in the geneology of haemoglobin. Nature 253: 603-608, 1975

Hill, R. E. and Hastie, N. D. Accelerated evolution in the reactive centre regions of serine protease inhibitors. Nature 326: 96-99, 1987

Hughes, A. L. Origin and evolution of HLA class I pseudogenes. Mol. Biol. Evol . 12: 247-258, 1995

Hughes, A. L. and Hughes, M. K. Natural selection on the peptide-binding regions of major histocompatibility complex molecules. Immunogenetics 42: 233-243, 1995

Hughes, A. L. and Nei, M. Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection. Nature 335: 167-170, 1988

Hughes, A. L. and Nei, M. Nucleotide substitution at major histocompatibility complex class II loci: evidence for overdominant selection. Proc. Natl. Acad. Sci. USA 86: 958- 962, 1989

Hyldig-Nielsen, J. J., Schenning, L., Hammerling, U., Widmark, E., Heldin, E., Lind, P., Servenius, B., Lund, T., Flavell, R., Lee, J. S., Trowsdale, J., Schreier, P. H., Zablitsky, F., Larhammer, D., Peterson, P. A. and Rask, L. The complete nucleotide sequence of the I-Ea d immune response gene. Nucleic Acids Res. 11: 5055-5071, 1983

Jaeger, J. A., Turner, J. H. and Zuker, M. Predicting optimal and suboptimal secondary structure for RNA. Meth. Enzymol . 183: 281-306, 1990

Klein, J. and O'hUigin, C. MHC polymorphism and parasites. Phil. Trans. R. Soc. Lond. B . 346: 351-358, 1994

Koller, B. H., Geraghty, D. E., Shimizu, Y., DeMars, R. and Orr, H. T. HLA-E, a novel class I gene expressed in resting lymphocytes. J. Immunol. 141: 897-904, 1988

Larhammer. D., Hammerling, U., Rask. L. and Peterson, P. A. Sequence of gene and cDNA encoding murine major histocompatibility complex class II gene Ab2. J. Biol. Chem . 260: 14111-14119, 1985

Le, S-Y. and Maizel, J. V. A method for assessing the statistical significance of RNA folding. J. Theor. Biol. 138: 495-510, 1989

Mallissen, M., Hunkapiller, T. and Hood, L. E. Nucleotide sequence of a light chain gene of the mouse I-A subregion: Ab d . Science 221: 750-754, 1983

Messer, G., Zemmour, J., Orr, H. T., Parham, P., Weiss, E. H. and Girdlestone, J. HLA- J, a second inactivated class I HLA gene related to HLA-G and HLA-A. J . Immunol. 148: 4043-4053,1992

Milner-White, E. J. Isoenzymes, isoproteins and introns. Trends Biochem. Sci . 9: 517- 519, 1984

Mouchiroud, D., Gautier, C. and Bernardi, G. Frequencies of synonymous substitutions in mammals are gene-specific and correlate with frequencies of non-synonymous substitutions. J. Mol. Evol . 40: 107-113, 1995

Nakashima, K-I., Ogawa, T., Oda, N., Hattori, M., Sakaki, Y., Kihara, H. and Ohno, M. Accelerated evolution of Trimeresurus flavoviridis venom gland phospholipase A2 isoenzymes. Proc. Natl. Acad. Sci. USA 90: 5964-5968, 1993

Nielsen, D. A., Novoradovsky, A. and Goldman, D. SSCP primer design based on single- strand DNA structure predicted by a DNA folding program. Nucleic Acids Res. 23: 2287-2291

Ohta, T. Further examples of evolution by gene duplication revealed through DNA sequence comparisons. Genetics 138: 1331-1337, 1994

Ohta, T. Gene conversion vs point mutation in generating variability at the antigen recognition site of major histocompatibility complex loci. J. Mol. Evol. 41: 115-119, 1995

Ohya, K., Kondo, K. and Mizuno, S. Polymorphism in the human class I MHC locus HLA-E in Japanese. Immunogenetics 32: 205-209, 1990

Radman, M., Wagner, R. and Kricker, M. C. Homologous DNA interactions in the evolution of gene and chromosome structure. Genome Anal . 7: 139-154, 1993

Salter, R. D., Benjamin, R. J., Wesley, P. K., Buxton, S. E., Garrett, T. P., Clayberger, C., Krensky, A. M., Norment, A. M., Littman, D. R. and Parham, P. A binding site for the T cell co-receptor CD8 on the a3 domain of HLA-A2. Nature 345:41-46, 1990

Schamboeck, A., Korman, A. J., Kamb, A. and Strominger, J. L. Organization of the transcriptional unit of a human class II histocompatibility antigen: HLA-DR heavy chain. Nucleic Acids Res. 11: 8663-8675, 1983

Schreier, T., Kedes, L. and Gahlmann, R. Cloning, structural analysis, and expression of the human slow twitch skeletal muscle/cardiac troponin C gene. J. Biol. Chem . 265: 21247-21253, 1990

Scpaer, E. G. and Mullins, J. I. Rate of amino acid change in the envelope protein correlates with pathogenicity of primate lentiviruses. J . Mol. Evol. 37: 57-65, 1993

Sinden, R. R. DNA Structure and Function . Academic Press, San Diego, 1994

Smith, N. H., Maynard Smith, J. and Spratt, B. G. Sequence evolution of the porB gene of Neisseria gonorrhoeae and Neisseria meningitidis: Evidence of positive Darwinian selection. Mol. Biol. Evol. 12: 363-370, 1995

Tanaka,T. and Nei, M. Positive Darwinian selection observed at the variable region genes of immunoglobulins. Mol. Biol. Evol . 6: 447-459, 1989

Zuker, M. (1989). Computer prediction of RNA secondary structure. Meth. Enzym . 180: 262-289.

Stem-Loop Potential in Retroviruses (Click Here)

Evidence from Overlapping Genes (Click Here)

Go: to Single Exon Database (Click Here)

Return to: Top of Page (Click Here)

Return to: Bioinformatics (Genomics) Index (Click Here)

Press release

Our knowledge regarding the genetic material, the genes, has increased dramatically during the last forty years due to achievements in the area of molecular biology. During the first decades, studies on simple organisms, in particular bacteria and bacterial viruses, dominated. A gene was conceived as a continuous segment within the very long double-stranded DNA molecules, the chemical substance of heredity. This simple picture of gene structure completely changed when Richard J. Roberts and Phillip A. Sharp in 1977 independently discovered that genes could be discontinuous, that is, a given gene could be present in the genetic material (DNA) as several, well-separated segments. As their experimental model system, both Roberts and Sharp used a common cold-causing virus, called adenovirus, whose genes display important similarities to those in higher organisms. Shortly thereafter it could be shown by several researchers that split genes are frequent in higher organisms, including man.

Roberts’ and Sharp’s discovery has changed our view on how genes in higher organisms develop during evolution. The discovery also led to the prediction of a new genetic process, namely that of splicing, which is essential for expressing the genetic information. The discovery of split genes has been of fundamental importance for today’s basic research in biology, as well as for more medically oriented research concerning the development of cancer and other diseases.

The genetic material

During the last forty years our knowledge of how the genetic material, the genes, governs the basic activities of life has increased dramatically. This is due to progress made within molecular biology, the area in science which explores biological phenomena and structures at the molecular level. Many of the most important discoveries within this area have been awarded a Nobel Prize. Examples include the discovery of how the nucleic acid DNA, the chemical substance of heredity, is built (1962), how the synthesis of nucleic acids takes place (1959), how the activity of genes is regulated (1965) and what the genetic code looks like (1968). This knowledge evolved primarily through studies of simple organisms such as bacteria and viruses infecting bacteria.

The general concept prevailing during the mid 1970s regarding the hereditary material and its function can be summarized as follows. A gene exists as a continuous stretch (segment) within a long, double-stranded DNA molecule. When the gene is activated, its information is copied into a single-stranded RNA molecule, called messenger RNA, which translates the information into a protein (figure 1A).

This simple picture of the sequence of events radically changed through the discovery made in 1977 by Richard J. Roberts, working at the Cold Spring Harbor Laboratory on Long Island, New York, and Phillip A. Sharp, working at the Massachusetts Institute of Technology in Cambridge, USA. They found that an individual gene can comprise not only one but several DNA segments separated by irrelevant DNA (figure 1B). Such discontinuous genes exist in organisms more complex than those studied earlier.

Figure 1: Gene structure and the flow of genetic information in bacteria (A) and higher organisms (B). In bacteria, the genetic information is stored as a continuous segment of DNA, and the messenger RNA can immediately direct the synthesis of the corresponding protein. In higher organisms, the gene is usually split, and the messenger RNA has to be processed by splicing before it can be translated into a protein.

How the discovery was made

Roberts and Sharp were studying the genetic material in adenovirus, a virus causing common cold. This virus infects the cells of higher organisms, and its genome has many properties resembling those of the host cell. At the same time, adenovirus has a simple structure, making it a very valuable experimental model for studying genes and their function in higher organisms. The genome of adenovirus consists of one single long DNA molecule. Roberts’ and Sharp’s aim was to determine where in the genome different genes were located.

In biochemical experiments it was shown that one end of an adenovirus messenger RNA did not behave as expected. One of several possible explanations was that the DNA segment corresponding to this end was not located in the immediate vicinity of the rest of the gene. To determine where this segment was located on the long DNA molecule, they used electron microscopy. They surprisingly found that a single RNA molecule corresponded to no less than four well-separated segments in the DNA molecule (figure 2). Roberts and Sharp came to the conclusion that the genetic information in the gene was discontinuously organized in the genome, a conclusion that contradicted the commonly held view regarding the structure of genes. The discovery immediately led to intensive research to find out whether this gene structure is present also in other viruses and in ordinary cells. Very soon after the initial discovery, several researchers could show that a discontinuous (or split) gene structure was common – and in fact the most common gene structure in higher organisms.

Figure 2: Schematic representation of the experiment that demonstrated that adenovirus DNA contains split genes. The genetic information in the messenger RNA resides in the DNA as four segments, which are separated by three intervening regions (a, b, and c). In the experimentally produced hybrid between one of the DNA strands and the RNA, the intervening sequences in the DNA strand appear as loops, i.e. the corresponding segments lack counterparts in the RNA. The hybrid could be directly visualized in the electron microscope.

The importance of the discovery

A gene may thus consist of several segments, usually termed exons separated by intervening DNA, termed introns. This knowledge has radically changed our view on how the genetic material has developed during the course of evolution. It has long been considered likely that evolution takes place as the result of the accumulation of minor alterations in the genetic material (mutations) resulting in a gradual change.

As a consequence of the discovery that genes are often split, it seems likely that higher organisms in addition to undergoing mutations may utilize another mechanism to speed up evolution: rearrangement (or shuffling) of gene segments to new functional units. This can take place in the germ cells through crossing-over during pairing of chromosomes. This hypothesis seems even more attractive following the discovery that individual exons in several cases correspond to building modules in proteins, so-called domains, to which specific functions can be attributed. An exon in the genome would thus correspond to a particular subfunction in the protein and the rearrangement of exons could result in a new combination of subfunctions in a protein. This kind of process could drive evolution considerably by rearranging modules with specific functions.

The discovery that genes can consist of two or more segments immediately led to a prediction with both surprising and important consequences. The first RNA product synthesized containing both exons and introns has to be “edited” such that the introns are cut out and the remaining exons are joined together to form a shortened RNA molecule. It has now been established that this process indeed takes place, and we have already accumulated detailed information on its nature. The process is called splicing and in higher organisms it represents an additional step in the transfer of information as compared to what usually occurs in lower organisms (figure 1B). The importance of splicing became particularly apparent, when it was found that it is not always the same segments that are identified as exons and are included in the final RNA molecule. In different tissues or developmental stages, the final RNA molecule may be different due to the utilization of alternative exon combinations. Thus, the same DNA region can in many cases determine the structure of many different proteins. The process is called alternative splicing and represents a fundamentally new principle: the genetic message, which gives rise to a particular product, is not definitely established at the stage when the RNA is first synthesized. Instead, it is the splicing pattern that determines the nature of the final product.

Medical aspects

Hereditary diseases are common – their estimated number is today no less than about 5000. Some of them are due to errors in the splicing process. The most studied of such diseases is beta-thalassemia, an anemia prevalent primarily in some Mediterranean countries.

The disease is due to a faulty protein, which forms part of hemoglobin in red blood cells. The protein is called beta-globin. If no or badly functioning beta-globin is made, the life-span of the red blood cells is shortened resulting in anemia. In different patients, small defects in the genetic material have been found, resulting in errors in the splicing process and thus in the synthesis of poorly functioning beta-globin. In the upper part of figure 3 the normal splicing of beta-globin RNA is shown (A). If the globin gene is damaged (marked by an arrow) it may, for example, lead to the formation of a larger than normal exon during splicing (B), or to the formation of a completely new exon (C).

Figure 3: Defective splicing causing beta-thalassemia. A normal beta-globin gene is presented in A, and two mutated genes that result in beta-thalassemia are shown in B and C. Arrows mark the position of point mutations. The interrupted lines denote the segments that are being joined during the splicing process. In the healthy individual, three segments are spliced as shown in A. In one of the thalassemia cases, an unusually long third segment is formed (B), while in the second one, an extra segment is produced (C).

Another example showing the connection between disease and the organisation of the genetic material into exons and introns is chronic myeloic leukemia, a type of cancer of the blood. Characteristic for this disease is the presence in tumor cells of a special, shortened chromosome, called the Philadelphia chromosome, named after the city in which it was discovered. This chromosome has arisen in a white blood cell by fusion of one end of chromosome 22 to one end of chromosome 9. At the break-point, a large portion of a cancer gene has been joined to another gene. Here we are thus dealing with two genes, which are now copied into one single RNA molecule. During the splicing process exons from the two genes are spliced to form an RNA molecule that specifies the synthesis of a new protein, a so-called fusion protein. This new protein gives rise to leukemia.

B. Alberts et al: Molecular Biology of the Cell. Garland, New York, 1989
P. Chambon: Split Genes. Scientific American 244, 60-71 (1981)
J.E. Darnell: RNA. Scientific American 253, 68-78 (1985)
J.E. Darnell et al: Molecular Cell Biology. Scientific American Books. Freeman, New York, 1990
E.H. McConkey: Human Genetics. The Molecular Revolution. Jones and Bartlett, Boston, 1993.

To cite this section
MLA style: Press release. Nobel Prize Outreach AB 2021. Mon. 28 Jun 2021. <>

Learn more

Nobel Prizes 2020

Twelve laureates were awarded a Nobel Prize in 2020, for achievements that have conferred the greatest benefit to humankind.

Their work and discoveries range from the formation of black holes and genetic scissors to efforts to combat hunger and develop new auction formats.

What is known about the function of introns, the nonencoding sequences in genes?

Ashok Bidwai, an assistant professor in the department of biology at West Virginia University, elaborates:

"It is widely believed that introns are remnants of genetic sequences that once served as spacers between the stretches of DNA that coded for specific, comparatively simple proteins. During the evolution of complex proteins, regions of the genetic code (known as domains) may have been shuffled and brought together to generate new sequences that code for novel protein structures that took on new functions. This hypothesis is based on the observation that the relative positions of introns in genes remain largely the same in organisms as diverse as Drosophila melanogaster (the fruit fly), Caenorhabditis elegans (a widely studied nematode), mice and humans. Walter Gilbert of Harvard University has laid out many of the details of this hypothesis.

"In addition, some researchers have proposed that introns serve as a mechanism that selects for genes that will be expressed early (rather than late) during the development of an organism. This idea is not based on extensive experimentation, however, so its plausibility is uncertain."

Sandro J. de Souza, who works in Walter Gilbert's laboratory at Harvard University, expands on the prevailing intron hypothesis:

"Questions about the function of introns appeared immediately after their discovery in 1977. What is the role of introns? Why are they here in our genes? Almost 20 years later we still do not have definitive answers, even though some DNA databases now contain around 500 megabases of sequences--that is, strings of genetic code that represent 500 million chemical letters of our genome.

"First, let's start with some classifications. There are at least five different types of introns. Some of them are ribozymes, RNA molecules that are catalytically active, meaning that they facilitate certain chemical reactions some of these ribozymes are able to perform a reaction in which they splice themselves out of the original transcript. The most common type of intron is called a spliceosomal or nuclear intron the name comes from the cellular machinery, known as the spliceosome, which is responsible for splicing and making sure that the genetic sequences in introns are not translated into junk proteins. This type of intron is the one found in the nuclear genes of humans.

"In general, nuclear introns are widespread in complex eukaryotes, or higher organisms. Simple prokaryotes and eukaryotes (such as fungi and protozoa) lack them. In complex multicellular organisms (such as plants and vertebrates), introns are about 10-fold longer than the exons, the active, coding parts of the genome. The sequence and length of introns vary rapidly over evolutionary time.

"Introns do sometimes have identifiable functions. Scientists have found clear examples of 'functional nuclear introns' that can accommodate sequences important for the expression of the gene on which the intron resides. This function is not a general feature of introns, however, because several genes that lack introns express themselves normally (histones and olfactory receptor genes, for instance). There are also cases in which introns contain genes for small nuclear RNA, which is important for the translation of messenger RNA, an intermediary between DNA and proteins. Nuclear introns can also be important in a process called alternative splicing, which can produce multiple types of messenger RNA from a single gene. Although these examples demonstrate a constructive role for introns, they cannot explain why introns are so ubiquitous in our genes.

"In 1978 Walter Gilbert of Harvard expressed a different view of the nature of introns (in the same report in which he coined the terms 'exon' and 'intron'). He suggested that introns could speed up evolution by promoting genetic recombinations between exons. This process (which he called 'exon shuffling') would be directly associated with formation of new genes. Introns, from this perspective, have a profound purpose. They serve as hot spots for recombination in the formation of new combinations of exons. In other words, they are in our genes because they have been used during evolution as a faster pathway to assemble new genes. Over the past 10 years, the exon shuffling idea has been supported by data from various experimental approaches.

"Several genome projects will be concluded in the next decade. They are expected to yield a huge amount of information about intron sequences. The new data should solve most of our basic questions about the functions of introns.

Summary – Group I vs Group II Introns

Group I and II introns are large ribozymes that catalyze a transesterification reaction to splice out introns from the primary transcript. They are found in all three domains. They both are mobile genetic elements. Moreover, they are used as tools in biotechnology and molecular medicine. However, the group I introns initiate splicing reaction by the nucleophilic attack of the 3′ OH of a guanosine cofactor at the 5P splice site. But, the group II introns initiate splicing reaction by the nucleophilic attack of the 2′ OH of the branch site adenosine on the 5′ splice junction. Moreover, group II introns form a lariat like structure during the splicing while group I introns do not form a lariat like structure. Thus, this is the summary of the difference between group I and group II introns.


1. Bonen, J. Vogel, et al. “Evolution of Group II Introns.” Mobile DNA, BioMed Central, 1 Jan. 1970, Available here.
2. Tourasse, Nicolas J, and Anne-Brit Kolstø. “Survey of Group I and Group II Introns in 29 Sequenced Genomes of the Bacillus Cereus Group: Insights into Their Spread and Evolution.” Nucleic Acids Research, Oxford University Press, Aug. 2008, Available here.

Image Courtesy:

1. “RF00028” By This image is taken from the Rfam database (Public Domain) via Commons Wikimedia
2. “IntronGroupII” By User:Sangak – Own work (Public Domain) via Commons Wikimedia

Watch the video: DNA Transcription Made EASY. Part 2: Elongation (October 2022).