Source of DNA sequences

Source of DNA sequences

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I'm working on a project where I am taking DNA sequences and translating the codons into musical notes. I have some good ideas of how to do this, I'm just not sure what sequences to work with. My case study and a lot of my initial research showed that anything larger than a protein is too much to work with for now.

So my question. Where can I find protein DNA sequences in some standard format. I've looked at NCBI but I have no idea what I'm looking at or if I'm downloading the right stuff. Is there a link that I can go to that will have a listing of a bunch of difference protein sequences that I can download?

GenBank and RefSeq have a huge collection of DNA sequences that can be downloaded in Fasta format. GenBank makes it very easy to search for sequences (for example, using the organism name) but will often have redundant data and varied types (everything from whole chromosome sequences to mRNAs to gene sequences to ESTs). RefSeq has much cleaner, non-redundant data but might take a little more effort for you to find the sequences you want. Depending on whether this has any effect on your project, you might want to consider that. On the RefSeq page, under the FTP menu you can click Genomes to see which genomes have RefSeq data available. The Fasta sequences for a particular genome will be stored in a dedicated directory, possibly subdivided into further directories by chromosome. What you want are the Fasta files (ending in .fa, .fasta, .ffn, or something like that), keeping in mind they may be compressed (.fa.gz, fasta.gz, etc).

Alternatively, if you already have a specific organism in mind, then you could do a Google search to see if the genome for that organism has been sequenced. If so, there is almost always a dedicated website where you can download DNA sequences for that genome.

A standard format would be the FASTA format.

If you have a few proteins of interest that you would like to look at you can simply select 'Nucleotide' at the top of the NCBI page, enter your protein's name and then press the FASTA button below each result you are interested in.

If you just need a large dataset of DNA sequences that code for proteins you could for example use the 'nt' blast database. (I believe this is only protein sequences not entirely sure though, I thought it would be easier to get)

This is more of a comment, but too long to put in a comment box, so I'm putting it here.

This is a fun idea you're doing. I have a half baked idea (assuming you're looking for input) if you want to explore it further -- or not… by the time I finish writing this, I might realize it's too silly, but still… let's see.

It might be fun to take the sequence of two organisms, let's say mouse and human and align certain regions to each other -- imagine this is like playing a piano where the "left hand" might be the mouse sequence, and the "right hand" is human.

So, say you take a gene that are shared in both, like CCND1. You can align them against each other and you'll find large portions of the sequences are common (with some mismatches, obviously). In these regions, the left and right hands are playing together (different octaves.

You'll also find gaps in the alignments where you'll have a stretch of "mouse only" or "human only" sequence, and in these regions the left or right hand play alone (solo).

For instance, say the two alignments look like this:


In this case, you see stretches of the alignments where the mouse (left hand) will be playing a solo, and other times the two hands play in "harmony."

Lecture 17: Genomes and DNA Sequencing

Professor Martin talks about DNA sequencing and why it is helpful to know the DNA sequence, followed by linkage mapping and then the different methods of sequencing DNA.

Instructor: Adam Martin

Lecture 1: Welcome Introdu.

Lecture 2: Chemical Bonding.

Lecture 3: Structures of Am.

Lecture 4: Enzymes and Meta.

Lecture 5: Carbohydrates an.

Lecture 9: Chromatin Remode.

Lecture 11:Cells, The Simpl.

Lecture 16: Recombinant DNA.

Lecture 17: Genomes and DNA.

Lecture 18: SNPs and Human .

Lecture 19: Cell Traffickin.

Lecture 20: Cell Signaling .

Lecture 21: Cell Signaling .

Lecture 22: Neurons, Action.

Lecture 23: Cell Cycle and .

Lecture 24: Stem Cells, Apo.

Lecture 27: Visualizing Lif.

Lecture 28: Visualizing Lif.

Lecture 29: Cell Imaging Te.

Lecture 32: Infectious Dise.

Lecture 33: Bacteria and An.

Lecture 34: Viruses and Ant.

Lecture 35: Reproductive Cl.

PROFESSOR: And today I'm going to talk about DNA sequencing. And I want to start by just sort of illustrating an example of how knowing the DNA sequence can be helpful. So you remember in the last lecture, we talked about how one might identify a gene through functional complementation. And this process involved making a DNA library that had different fragments of DNA cloned into different plasmids and then involved finding the needle in the haystack where you find the gene that can rescue a defect in a mutant that you have.

So if this line that I'm drawing here is genomic DNA, and it could be genomic DNA from, let's say, a prototroph for LEU2, the leucine gene. So this is from a prototroph.

Then you could cut up the DNA with EcoRI. And if there is not a restriction site in this LEU2 gene, you get a fragment that contains the LEU2 gene. And then you could clone this into some type of plasmid that replicates in the organism that you're introducing it and propagating it in.

And so that would allow you to then test whether or not this piece of DNA that you have compliments a LEU2 auxotroph, OK?

Now one thing I want to point out is that because these EcoR1 sites, these sticky ends, would recognize this EcoR1 one end or this EcoR1 end, you can imagine that this gene-- if the gene reads this way to this way-- it could insert this way into the plasmid. Or it could insert in the opposite direction. So it could be inverted. So this would have some sort of origin of replication and some type of selectable marker.

But if you have the same restriction site it can insert one way or the opposite way. That's just one thing I wanted to point out.

Now let's say rather than leucine, you're interested in cycling dependent kinase, and you had a mutant end CDK and you had this sequence of your yeast CDK gene. Well, rather than having to dig through a whole library of pieces of DNA for the CDK gene, basically you're sort of fishing for that needle in the haystack. If you knew the sequence of the human genome, you'd be able to identify similar genes by sequence homology.

And you could then take a more direct approach, where you take-- let's say you have a piece of human DNA now, double stranded DNA, and it has the CDK gene. You could take human DNA with this CDK gene. And you have unique sequence around the CDK gene, which would allow you to denature this DNA. And if you denature the DNA, you'd get two single strands of DNA. And you could then design primers that recognize unique sequences flanking the CDK gene.

So you could imagine you'd have a primer here and a primer here. And then you could use PCR to amplify specifically CDK gene from, it could be the genome or from some library. And then you get this fragment here, which includes CDK. So knowing the sequence of the genome would allow you to more rapidly go from maybe a gene that you've identified as being important in one organism, and find the human equivalent that might be doing something similar in humans. So this step here is basically PCR.

And let's say the CDK gene had restriction sites. Let's see, we'll say restriction site K and A here. Then if you have these restriction sites in your fragment of DNA, you can then digest or cut that piece of DNA with these restriction endonucleases. And then you'd get a fragment of CDK that has K and A sticky ends. We'll pretend that both of these have sticky ends.

And now you have unique sticky ends between K and A. And you might have a vector that also has these two sites. And you could digest this vector with these two enzymes. And that would allow you to insert the specific gene in this plasmid.

And if you have two unique sites, because K only recognizes K here and A only recognizes A, then it will ligate in. But you can do it with a specific orientation because you have two different restriction sites. So I hope you all see how it's with one restriction site versus two.

All right. Now let's say you want to do something more complicated than this. Let's say rather than just identifying the gene that's involved in cell division, you want to engineer a new gene, in order to determine where this particular protein, CDK, localizes in the cell. So we have CDK, which could be from yeast or human, it doesn't matter. And you want to engineer a new protein, basically, that you can see.

So remember Professor Imperiali introduced green fluorescent protein earlier in the year. And this green fluorescent protein is from a gene from jellyfish. So now we could, using what I've told you, reconstruct or engineer a gene that has DNA from three different organisms, in order to make a CDK variant that we are able to see in the cell.

So remember, a green fluorescent protein is like a beacon, if it's attached to a protein. If you shine blue light on it, it emits green light. And so you can use a fluorescent microscope in order to see it.

In this case, let's say there's also another restriction site here, R. And let's say you have a fragment of GFP that has two restriction sites, A and R. You could then cut this fragment and this fragment with these restriction enzymes A and R. And you could insert GFP at the C terminus of the CDK gene. So you could go and have a gene that has CDK GFP inserted inside a bacterial vector.

Now which one of these junction sites do you think would be most sensitive in doing this type of experiment? So there are three junction sites. There's this one, this one, and this one. Which is the one you're probably going to put the most thought into when you're doing this experiment?

ADAM MARTIN: The A site. Miles is exactly right. This one is going to be important. And why did you choose that site?

AUDIENCE: Of the three sites, two are half insert, half originals [INAUDIBLE]. But at A, both sides of it are inserts. So [INAUDIBLE] carefully.

ADAM MARTIN: And if you're trying to make a fusion protein, what's going to be an important quality of this? Malik, DID you have a point?

AUDIENCE: Well, they try to [INAUDIBLE] we'd have to make sure that the [INAUDIBLE].

ADAM MARTIN: Excellent job. So Malik just pointed out two really important things. To make this a fusion protein, you have two different open reading frames. These two open reading frames have to be in frame with each other.

So this junction here has to be in frame where GFP is in frame with CDK, meaning that you're reading the same triplet codons for GFP, there in the same frame as CDK. Also, you want to make sure there's no stop codon here. Because if you had a stop codon here, you're just going to make a CDK protein. And then it's going to stop and then you won't have it fused to GFP.

And you guys will work through more of these in the homework. So you'll be able to get a sense of it.

So now for the remainder of this lecture and also for Monday's lecture, I want to go through a problem with you. Basically, if you have a given disease that's heritable, how might you go from knowing that disease is heritable to finding out what gene is responsible for that given disease? And this is going to involve thinking about different levels of resolution, in terms of maps.

So the highest resolution map you can have for a genome is the sequence. You can have the full nucleotide sequence of a genome. And that's the highest possible resolution because you have single nucleotide resolution as to what every single base pair is. But that's like knowing like your apartment number and your street number and basically knowing everything. But starting out, you might want to know what continent it's on, or what country is it in.

And so you first have to narrow down the possible locations for a given disease gene. And that will, at first, involve establishing what chromosome and what region of a chromosome a given disease allele is linked to. And that involves making essentially a linkage map, where you establish where a disease gene is located based on its linkage to known markers that are present in the genome.

Now this is going to require that you remember back two weeks ago, to when we talked about linkage and recombination. And you'll recall that we were looking at the linkage between genes and flies and genes and yeast. One difference between that type of linkage mapping and human linkage mapping is we don't have really clear traits that are defined by single genes. You can't just take hair color and map the hair color gene to link it to a disease gene. Because hair color is determined by many, many different genes.

So in fruit flies, you can take white eyes and see if it's connected with yellow body color because both of those are determined by single genes. So we need something other than just having phenotypic traits that we can track. We need what are known as molecular markers to be able to perform linkage mapping.

And so what we need in these molecular markers-- well, if we just think about if we wanted to determine the linkage between the A and B genes. And if you did this cross, would you be able to determine linkage?

Georgia, you made a motion that was correct. Tell me. Why did you shake your head no?

AUDIENCE: They'd all be heterozygous.

ADAM MARTIN: Yeah they'd all be heterozygous. Because this individual has the same allele on both chromosomes, you're not going to be able to differentiate one chromosome from the other. And so the point I want to make is that in order to see linkage, what you need is variation.

So we need to have variation. And another term for genetic variation is polymorphism. So we need polymorphism, or genetic variation, between these molecular markers.

We also need genetic variation in the disease. But we have that. We have individuals that are affected by a disease and individuals that are not affected by a disease. So we have variation in alleles there. But in order to map it with a molecular marker, to map linkage to a molecular marker, you also need variation here. So the problem with this cross is here you need to have heterozygote. There needs to be variation in this individual, where both of these alleles are heterozygous.

So now I want to talk about some of these molecular markers that we can use, and how they vary between individuals and between chromosomes. Now this is going to be maybe the lowest resolution map. But I'm talking about this linkage map here. And you can see highlighted that the bottom here are various types of polymorphisms that we can use to link a given disease allele to a specific chromosome and a specific place on chromosome.

So I'll start with the first one, which is a simple sequence repeat. It goes by many names. But I will stick with what's on the slide.

So a simple sequence repeat is also known as a microsatellite. So you might see that term floating around, if you're reading about this. And what a simple sequence repeat is, as the name implies, it's a simple sequence. It could be a dinucleotide, like CA. And it's just a dinucleotide that's repeated over and over again.

So on a chromosome, you might have a unique sequence, which I'll just draw as a line. , And then you could have a CA dinucleotide that's repeated some number of times, N. And then that's followed by another unique sequence. And that's what's present in it.

So that would be one strand. And then in the opposite strand, you'd have a unique sequence, the complement of CA, which is GT, and then, again, unique sequence. And so there's variation in the number of repeats of the CA. And so there's polymorphism. So we can use this to establish linkage between this marker and a phenotype, like a disease phenotype.

So how might you detect the number of repeats that are present here? Anyone have an idea of a tool that we've discussed that could be used here? So one hint that I gave you is that the sequence here is unique and the sequence here is unique. So is there a way we can leverage that unique sequence to determine whether there's a difference in the number of repeats?

What's a technique we discussed that involves some component of the technique recognizing a unique sequence? Yeah, Natalie?

ADAM MARTIN: Well, CRISPR Cas9 is a possibility. Jeremy, did you have an idea?

ADAM MARTIN: PCR-- so it's true. You could get it to recognize that. But then you have to detect it, somehow. So what's more commonly used is PCR. Those are both good ideas. But using PCR, you could design a primer here and a primer here. And you could amplify this repeat sequence. And the number of repeats would determine the size of your PCR fragment.

So if you did PCR, then you'd get a PCR fragment that has the primers on each end, but then has this certain size based on the number of repeats. So in that case, we need some sort of tool that enables us to determine the size of a particular DNA fragment. And so I'm going to just introduce to you one such tool, which is gel electrophoresis.

And gel electrophoresis involves taking DNA that you've generated, by either PCR or by cutting up DNA with a restriction enzyme, and loading it in a gel that has agarose. Maybe it's composed of agarose. It could be composed of polyacrylamide. And then because DNA is negatively charged, the backbone, if you run a current through it, such as the positive electrode is at the bottom, then the DNA is going to snake through this gel.

Now we'll do a quick demonstration, if you two could come up. I need one volunteer. Ori, find 10 of your friends and bring them down. All right. That's probably good. Yeah.

All right, Hannah, why don't you-- you guys have to link up, OK? Stay over here. We'll start at this end. This is the negative electrode over here. The positive electrode is going to be down there. And Jackie is going to be our single nucleotide. You guys link like-- yeah. You don't have to do-si-do, or anything like that.

All right. Now what I want you guys to do is I want you to slalom through these cones like it's all agarose gel. So that you're going towards the other side. And I'm going to turn on the current now. So go. All right, stop.

All right. See how the shorter DNA fragment is able to more easily navigate through the cones and get farther. So it was somewhat rigged. I know. But I just needed some way to make sure you always remember that the shorter nucleotide, or the shorter fragment, is going to migrate faster.

You guys can go back up. Thank you for your participation. Let's give them round of applause.

All right. So what you just saw is that the longer DNA fragments, they're going to be more inhibited by moving through the gel. And so they're going to move slower and thus, not move as far in the gel. Whereas, the small fragments are going to move much faster because they're able to maneuver their way through this gel much more quickly. So there's going to be an inverse proportionality between the size of the DNA chain and its rate of movement. You're always going to see the shorter DNA fragment moving faster.

So what one of these gels actually looks like is shown here. So this is a DNA gel that's agarose. And DNA has been run in these different samples. And what you're seeing is this gel is subsequently stained with a dye, like ethidium bromide, which allows you to visualize the individual DNA fragments. And so a band on this gel indicates a whole bunch of DNA fragments that are all roughly the same length. So essentially, you can measure DNA length using this technique.

What's over here at the end of the gel, this is probably some sort of DNA ladder, where you have DNA fragments of known length that you can use to calibrate the length of these bands over here. So this is how you measure DNA length. And we're going to use it over and over again, as we talk about DNA and sequencing.

So now, let's think about how this is going to help us establish linkage between a particular marker in the genome and a genetic disease. So if we think about these microsatellite repeats, I told you they're polymorphic. They exhibit a lot of variation in size. And so here's an example showing you a female who has two intermediate sized microsatellites. And if you look at this-- if you did PCR and measured the size of these, you get two different bands because there are two different alleles of different length here.

So you can see this individual has two intermediate length repeats. And this person has had children with an individual that has a short and a long microsatellite. And you can see that on the gel, here.

Now this female is affected by some disease. And these two individuals have children. And you can see that a number of those children are affected by the disease. So what mode of inheritance does this look like? If you had your choice between autosomal recessive, autosomal dominant, sex linked dominant, and sex linked recessive, what mode of inheritance is this looking like? Oh, Carmen.

AUDIENCE: Autosomal recessive.

ADAM MARTIN: Autosomal recessive? Why do you go with recessive? Yeah, go ahead.

AUDIENCE: Because there is a male that's affected. But not both of the parents are affected. So it seems like the father is heterozygous and the mother is homozygous recessive.

ADAM MARTIN: That's possible. That's exactly the logic I want to see. Is there another possibility? Yeah, Jeremy.

AUDIENCE: Autosomal dominant.

ADAM MARTIN: It could also be autosomal dominant. So you're right. You're right. If this was not a rare disease, then that male could care be a carrier and could be passing it on to half the children. So that's good. You'd essentially need more information to differentiate between autosomal recessive and autosomal dominant.

For the purposes of this, we're going to go with autosomal dominant. And what you see is that you want to look at the affected individuals and see if the disease phenotype is linked, or connected, with one of these microsatellite alleles. So if we look at-- we basically PCR DNA from all these individuals. And if you look at who is affected, each one of the individuals has this M double prime band. And none of the unaffected individuals has it.

So obviously, it would be better to have more pedigrees and more data to really establish significance between this linkage. But this is just a simple example, showing what you could possibly see if you have one of these molecular markers linked to a particular disease allele. So that kind of establishes the principle.

Now let's think about what are some other molecular markers that are possible? So another type of marker, and this is one that's the most common one, if I go here. So here, you see here's is a linkage map, here. And you see most of these bands are green. And the green markers, here, are what are known as Single Nucleotide Polymorphisms, or SNPs.

So single nucleotide polymorphisms-- and this is abbreviated SNP. And what a single nucleotide polymorphism is, is it's a variation of a nucleotide at a single position in the genome. So it's just a one base pair difference at a position. So there's variation of single nucleotide at a given position, at a position in the genome. And because that's a pretty general definition, there are tons of these in the genome.

Now one thing to think about is you could have a mutation in an individual that creates a SNP. So you could have a de novo formation of a SNP. But if you have a SNP and it gets incorporated to the gametes of an individual, then that variant is going to be passed on to the next generation. So this is something that could occur de novo. But it is also heritable. And if it's heritable, then you can follow it and use it to determine if a given variant is linked to a given phenotype, like a disease.

So to identify a single nucleotide polymorphism, it's helpful to be able to sequence the DNA. And I'll talk about how we could do that in just a minute. But before I go on, I just want to point out a subclass of SNPs that can be visualized without sequencing. And these are called restriction fragment length polymorphisms.

So restriction fragment-- so it's going to involve some type of restriction enzyme digest length polymorphism. It's a long word. But it's abbreviated RFLP. And what this is, is it's a variation of a single nucleotide. But this is a subclass of SNP. Because this is when the variation occurs in a restriction site for a restriction enzyme.

So if you remember your good friend EcoR1, EcoR1 recognizes the nucleotide sequence GAATTC. And EcoR1 only cleaves DNA sequence that has GAATTC. So if there was a single nucleotide variation in the sequence, such that it's now GATTTC, or something like that, that destroys the EcoR1 site. And so EcoR1 will no longer be able to recognize this site in the genome and cut it.

So you could imagine that if you had one individual in the genome having three EcoR1 sites, if you digest this region, you'd get two fragments. But if you destroyed the one in the middle, then if you digested this piece of DNA, then you'd only get one fragment. And that's something. Because it results in different sizes of fragments, that's something you can see just by doing DNA electrophoresis. And maybe you would use some method to detect this specific region, so that you're not looking at all the DNA in the genome, but you're establishing linkage to this specific area.

You could use PCR. You can have PCR primers here and here. And you could then cut with EcoR1. In one case, you'd get two fragments. In this case, you'd get two fragments. In this case, if you amplified this region of the genome and cut with EcoR1, you'd only get one fragment. So you'd be able to differentiate between those possibilities.

AUDIENCE: When you use PCR, are there [INAUDIBLE]?


ADAM MARTIN: Oh. You're saying what causes it to stop? That's a great question, Malik. Yeah.

So initially, it's not going to stop. That's absolutely right. But because every step, each time you replicate, it's then primed with another primer. So you'd replicate something like this that's too long. But then the reverse primer would replicate like this. And it would stop.

So if you go back to my slide from last lecture, look through that and see if it makes sense how it's ending. Because if you do this 30 times, you really will enrich for a fragment that stops and ends at the two primers, or begins and ends at the two primers, I should say. Good question. Thank you.

All right. Now, let's talk about DNA sequencing. Because as I showed you, obviously, these SNPs, because there are so many of them, are probably the most useful of these markers to narrow in on where your disease gene is. And to detect a SNP, we need to be able to sequence DNA.

So I'm going to start with an older method for DNA sequencing, which conceptually, is very similar to how we do DNA sequencing today. And so it will illustrate my point. And then at the end, I'll talk about more modern techniques to sequencing.

So the technique I'm going to introduce to you is called Sanger sequencing. And that's because it was identified by an individual named Fred Sanger. And I'm going to just take a very simple DNA sequence, in order to illustrate how Sanger sequencing works.

So let's take a sequence that's really simple. This is very, very simple, and then more sequence here. So let's say we want to determine the nucleotide that's at every position of this DNA fragment. So one way we could maybe conceptually think about doing this, is to try to let DNA polymerase tell us where given nucleotides are. And if we're going to use DNA polymerase, what are we going to need, in order to facilitate this process? Yes, Rachel.

ADAM MARTIN: You're going to need nucleotides, definitely. So we're going to need nucleotides. What else? To start, what are you going to need? Miles?

ADAM MARTIN: You're going to need a primer, exactly. Good job. So you need a primer. So here's a primer.

And now, we're going to try to get DNA polymerase to tell us whenever there is a given nucleotide in this DNA sequence. And so think with me. Let's say we were able to get DNA polymerase to stop whenever there was a certain nucleotide.

So if we go through just a couple nucleotides, let's say, at first, we want DNA polymerase to stop whenever there's an A. So let's say there was a possibility it would stop at this A. If it's stopped at this A, you'd generate a fragment of this length. But if it read on through that A, there's another possibility that it would stop at this A.

So we're kind of looking at when these are stopping. And the final possibility is it goes on and stops at this A. So if this DNA polymerase stopped only at As, you'd get fragments that are these three discrete lengths.

Now let's consider another possibility. So pink here is stop at A. And in blue, I'm going to draw what would happen if it stopped at T. So they all start from the same place. If it stopped at T, it would just stop one nucleotide beyond this A in this simple sequence. So in blue here, this is stop at T.

But if it's just a possibility, it stops. And some of the polymerases could go beyond this T and go to the next T and stop here. And again, this would be one nucleotide length longer than this pink one, here. And the final one would-- I'll just draw it down here-- would get out to this last T, here.

So what you see is if we could get DNA polymerase to stop at these discrete positions, we'd get a different sized fragments, whether it was stopping at one nucleotide versus the other nucleotide. You all see how this is resulting in different fragment lengths. Yes, Andrew.

AUDIENCE: How would you create a pattern [INAUDIBLE]?

ADAM MARTIN: There are companies now. You can basically take nucleotides and synthesize these primers chemically, not using DNA polymerase.

AUDIENCE: I'm saying how would you know what primer to use, if you don't know the sequence?

ADAM MARTIN: Oh, in this case, you'd have to start with some sequence that you know. So in most sequencing technologies, you kind of make a DNA library, where you know the sequence of the vector. And then you'd use the vector sequence as a primer to sequence into the unknown sequence. Great question. Good job.

All right. So what we need now then is some sort of tool or ability to stop DNA polymerase when there's a certain nucleotide base. And to do that, we can use this type of molecule, here, which is known as a dideoxynucleotide. Remember, for DNA polymerase to elongate a chain, it requires that the last base have a three prime hydroxyl.

And so what this dideoxynucleoside triphosphate is, is it's a nucleoside triphosphate that lacks a three prime hydroxyl. Here, I'll highlight that.

So you see this guy? You see it bolt the highlight H? There's a hydrogen there on the three prime carbon, rather than the normal hydroxyl group. So if this base gets incorporated into a elongating chain, DNA polymerase is not going to be able to move on.

So this method where you can add a certain dideoxynucleoside triphosphate to stop chain elongation is known as a chain termination method. So you're getting chain termination. And you're getting this chain termination with a specific dideoxynucleoside triphosphate. So these dideoxynucleotide triphosphates, if they get incorporated into the DNA, are going to halt the synthesis of that DNA strand.

So if we take our example, here, this might be a reaction that has dideoxythymidine triphosphate. So if we had dideoxythymidine triphosphate in this sample and it's elongating, then when the polymerase reaches this point, there's a possibility that it will incorporate the dideoxynucleoside triphosphate. And if this is a dideoxynucleoside triphosphate, then there won't be a three prime hydroxyl.

And DNA polymerase will just be like, oh, I can't go on! Because it's not going to have a three prime hydroxyl. So it's not going to be able to continue with the next nucleotide. So this is known as chain termination.

So let me take you through an example, here. All right. So here's an example that you have a slide of. And again, there's a template strand, which is the top strand. And this method requires that you have a primer. And what's often done is you label the primer.

So the first step is you have to denature your DNA. So you have to go from double stranded DNA to a single stranded DNA. And then you mix the double stranded DNA with first, this labeled primer, such that the primer can then yield to the single stranded DNA. You need DNA polymerase, as I've mentioned. And as, I believe, Rachel mentioned before, you need the building blocks of DNA. So you need the four dideoxynucleoside triphosphates.

So you always have the four dideoxynucleotide triphosphates. But what's special here is you're going to spike several reactions with one of the dideoxynucleoside triphosphates. So you spike the reaction with a tiny amount of one of your dideoxynucleoside triphosphates.

So let's say you have a reaction, here. And this this one here has dideoxyadenosine triphosphate. Then polymerase will along get this strand until there's a thymidine on the template. And then there's a possibility that it will incorporate this dideoxy NTP. And if it does, then you get chain termination. And you get a fragment of this length.

But the other possibility, because there is still the deoxy form of the NTP present, it's possible that it incorporates a deoxyadenosine triphosphate there. And keeps going, and then incorporates a dideoxy ATP later on, where you have another T. And so the polymerase will essentially randomly stop at these different thymidine residues, depending on whether or not a dideoxynucleoside triphosphate is incorporated. And that means for a given reaction, one in which you have dideoxy ATP, you get a certain pattern of bands that represent the length of fragments, where you have, in this case, a thymidine base.

And then you do this for all four bases, where you have four reactions, each with a different base that's dideoxy. So when you're adding these, you're going to do four reactions, one with dideoxy ATP spiked in, one with dideoxy TTP, one with dideoxy CTP, and the last with dideoxy GTP. And because these nucleotides are present in different positions along the sequence, you're going to get distinct banding pattern for each of these reactions. But using that banding pattern, you can then read off the sequence of DNA that's present on the template strand.

So this is how sequencing was done for many, many years. These days, it's been made cheaper and faster. And now what's often used is next generation sequencing. And one the pain in the ass about sequencing before is you'd use a lot of radioactivity. Your primer would be radioactive, so that you could detect these bands. Right now, everything's done using fluorescence, which makes it much nicer, I think.

And so in next generation sequencing, your template DNA is attached to a solid substrate, such that it's immobilized on some type of substrate. And then you add each of the four nucleoside triphosphates. In this case, they're labeled with a dye, such that each one is a different color. But the dye also functions to prevent elongation, such that, again, it's this chain termination. When you incorporate one of these, the polymerase just can't run along the DNA. It incorporates one and then stops.

So if you get your first nucleotide incorporated, it will incorporate one of these four. And it will be fluorescent at a certain wavelength, which you can see using a device or microscope. And then what you then do is chemically modify this base, such that you remove the dye and allow it to extend one more base pair. And so you go one nucleotide at a time. And you read out the pattern of fluorescence that appears. And that gives you the sequence of DNA on this molecule that's stuck to your substrate.

And you can do this in parallel. You can have tons, many different strands of DNA. And you can be reading out the sequence of each one of these strands in parallel.

Great. Any questions about DNA sequencing? OK. Very good. I will see you on Monday. Have a great weekend.

Big data in modern biology

There is now no question that genomics, the study of the genomes of organisms and a field that includes intensive efforts to determine the entire DNA sequence of organisms, has joined the big data club. The development of prolific new DNA sequencing technologies is forcing biologists to embrace the dizzying terms of terabytes, petabytes and, looming on the horizon, exabytes.

The exabyte (derived from the SI prefix exa-) is a unit of information or computer storage equal to one quintillion bytes (short scale). The unit symbol for the exabyte is EB. —from Wikipedia

The resulting cultural shift has led to a wave of immigrants into the land of pipettes and PCR tubes the computer scientists, physicists and mathematicians, bringing their exotic and diverse analytical expertise with them. What they also brought was a tradition of adherence to the open source paradigm for software, recently adopted in the genomics world by the preconditions of funding agencies and scientific journals that data and software be shared freely.

The success of initiatives like R/Bioconductor as a resource involving peer review and publication of software has provided a great incentive for young researchers in genomics to develop creative new solutions, the kind of productive hacking that needs to be encouraged in any big data field.

We currently seem to be undergoing one of those episodes of punctuated equilibrium in the evolution of genomics software. The shift is occurring from the prior focus on individual packages addressing specific analytical tasks to more encompassing systems that combine individual packages as workflows, while managing data and processing resources and capturing metadata.

While many separate initiatives are addressing components, the field still lacks the ideal, complete system, and as a result we have not been able to foster the irreverent and innovative approaches to the integrated system that a clandestine, off-the-books skunkworks can nurture. In short, there is a desperate, essential, need for an informatics ecosystem in genomics. The requirements for this complete system—or ecosystem—quickly enters the realm of grandiosity. Not only would it be ideal for this system to address genomics challenges, it must also ensure interoperability with other areas of digitial biology such as proteomics, metabolomics and imaging, that they also be accommodated.

We also have to face the issue that as genomics sequencing moves from the large centers and core facilities into the hands of individual researchers, the heterogeneity of hardware resources serving these analyses becomes extreme. Can an open source project address all of these sprawling requirements?

The Wasp System software project is an audacious attempt to create a foundational software ecosystem for modern molecular biology. The design is conceptually simple, based on a kernel which mediates between diverse user, processing, and data resources, and a diversity of plugin components interfacing via a common API. Engineered in Spring Framework, the Wasp System’s architecture both modularizes and abstracts many of the operational functionalities unique to Spring, to the benefit of the each functional component of a given genomics workflow, or indeed any workflow, making it so modular that it could address all genomics needs and then extend to the other areas of big data generation in biology.

Leading the project at the Albert Einstein College of Medicine in New York is Andy McLellan, who is unusual for combining a University of Cambridge doctorate in Molecular Biochemistry with a subsequent Masters degree in software development and design. He made the strategic decision to develop the Wasp System software using the Spring framework for Java, with open source development in mind. The development team also includes the head of Einstein’s Computational Genomics core facility, Brent Calder, who has migrated genomic software tools into the original Wasp environment to automate analyses for molecular biology colleagues. Their prototypic, Perl-based system focused on genomics has been functional for almost 4 years, managing and processing almost a petabyte of sequencing data over this time and creating a foundation of experience that few others can match in this young field.

Now that Wasp is rewritten in Spring/Java, the goal of the developers is to give it away to as many people as possible. Initial test partners have included the Memorial Sloan-Kettering Cancer Center in New York, the University of California San Diego, and the Australian Genome Research Facility. Once 'out in the wild,' the open source approach relies at least partially on volunteers to act as curators, but the maintenance plan for the Wasp System is more overtly structured, what the Einstein group calls a nurtured open source model. Component plugins are developed by community users to suit local institutional needs, but are tested by the Einstein group for forward compatibility with the latest Wasp software versions, available on Github.

In return for making the plugin forward compatible with Wasp, the participating developer makes their component available to the entire Wasp user community, a model designed to expand the capabilities of the system as a whole quickly. Wasp also addresses the challenge of increasingly diverse hardware resources available to process the data generated, again taking advantage of the fundamental design envisaged by Andy McLellan and Brent Calder. The processing scheduling component of Wasp is essentially agnostic as regards hardware implementation, and as such, anticipates future trends towards cloud or grid-based computing. By taking advantage of Spring’s capabilities, it is possible to build the basis for a distributed peer network, where instances share and provision data intuitively and respond to data generation by launching appropriate analytical pipelines on HPC resources, while reacting appropriately to computational and data-related errors. In this regard, the Wasp System encompasses a real-time messaging system, a responsive workflow and pipeline system, and an interface to HPC assets, designed around best practices for genomics analysis and the challenges of big data in the wider digitized life sciences.

With a powerful and flexible foundational software system like Wasp in place, what should emerge as a consequence is a sandbox for genomics hackers, where pipeline components are juggled, and innovative visualization tools are developed and implemented. A major emphasis for the Einstein group is the encouragement of Wasp’s use to host this kind of skunkworks.

The goal is to use Wasp to nurture the irreverent disregard for convention not currently possible using systems that are more rigid or of more limited scope. In this way we can hope to overcome big data challenges and flourish in the current era of genomics discovery unlocked by DNA sequencing technologies of unprecedented power.

14.2 DNA Structure and Sequencing

The building blocks of DNA are nucleotides. The important components of the nucleotide are a nitrogenous base, deoxyribose (5-carbon sugar), and a phosphate group (Figure 14.5). The nucleotide is named depending on the nitrogenous base. The nitrogenous base can be a purine such as adenine (A) and guanine (G), or a pyrimidine such as cytosine (C) and thymine (T).

The nucleotides combine with each other by covalent bonds known as phosphodiester bonds or linkages. The purines have a double ring structure with a six-membered ring fused to a five-membered ring. Pyrimidines are smaller in size they have a single six-membered ring structure. The carbon atoms of the five-carbon sugar are numbered 1', 2', 3', 4', and 5' (1' is read as “one prime”). The phosphate residue is attached to the hydroxyl group of the 5' carbon of one sugar of one nucleotide and the hydroxyl group of the 3' carbon of the sugar of the next nucleotide, thereby forming a 5'-3' phosphodiester bond.

In the 1950s, Francis Crick and James Watson worked together to determine the structure of DNA at the University of Cambridge, England. Other scientists like Linus Pauling and Maurice Wilkins were also actively exploring this field. Pauling had discovered the secondary structure of proteins using X-ray crystallography. In Wilkins’ lab, researcher Rosalind Franklin was using X-ray diffraction methods to understand the structure of DNA. Watson and Crick were able to piece together the puzzle of the DNA molecule on the basis of Franklin's data because Crick had also studied X-ray diffraction (Figure 14.6). In 1962, James Watson, Francis Crick, and Maurice Wilkins were awarded the Nobel Prize in Medicine. Unfortunately, by then Franklin had died, and Nobel prizes are not awarded posthumously.

Watson and Crick proposed that DNA is made up of two strands that are twisted around each other to form a right-handed helix. Base pairing takes place between a purine and pyrimidine namely, A pairs with T and G pairs with C. Adenine and thymine are complementary base pairs, and cytosine and guanine are also complementary base pairs. The base pairs are stabilized by hydrogen bonds adenine and thymine form two hydrogen bonds and cytosine and guanine form three hydrogen bonds. The two strands are anti-parallel in nature that is, the 3' end of one strand faces the 5' end of the other strand. The sugar and phosphate of the nucleotides form the backbone of the structure, whereas the nitrogenous bases are stacked inside. Each base pair is separated from the other base pair by a distance of 0.34 nm, and each turn of the helix measures 3.4 nm. Therefore, ten base pairs are present per turn of the helix. The diameter of the DNA double helix is 2 nm, and it is uniform throughout. Only the pairing between a purine and pyrimidine can explain the uniform diameter. The twisting of the two strands around each other results in the formation of uniformly spaced major and minor grooves (Figure 14.7).

DNA Sequencing Techniques

Until the 1990s, the sequencing of DNA (reading the sequence of DNA) was a relatively expensive and long process. Using radiolabeled nucleotides also compounded the problem through safety concerns. With currently available technology and automated machines, the process is cheap, safer, and can be completed in a matter of hours. Fred Sanger developed the sequencing method used for the human genome sequencing project, which is widely used today (Figure 14.8).

Link to Learning

Visit this site to watch a video explaining the DNA sequence reading technique that resulted from Sanger’s work.

The method is known as the dideoxy chain termination method. The sequencing method is based on the use of chain terminators, the dideoxynucleotides (ddNTPs). The dideoxynucleotides, or ddNTPSs, differ from the deoxynucleotides by the lack of a free 3' OH group on the five-carbon sugar. If a ddNTP is added to a growing a DNA strand, the chain is not extended any further because the free 3' OH group needed to add another nucleotide is not available. By using a predetermined ratio of deoxyribonucleotides to dideoxynucleotides, it is possible to generate DNA fragments of different sizes.

The DNA sample to be sequenced is denatured or separated into two strands by heating it to high temperatures. The DNA is divided into four tubes in which a primer, DNA polymerase, and all four nucleotides (A, T, G, and C) are added. In addition to each of the four tubes, limited quantities of one of the four dideoxynucleotides are added to each tube respectively. The tubes are labeled as A, T, G, and C according to the ddNTP added. For detection purposes, each of the four dideoxynucleotides carries a different fluorescent label. Chain elongation continues until a fluorescent dideoxy nucleotide is incorporated, after which no further elongation takes place. After the reaction is over, electrophoresis is performed. Even a difference in length of a single base can be detected. The sequence is read from a laser scanner. For his work on DNA sequencing, Sanger received a Nobel Prize in chemistry in 1980.

Link to Learning

Sanger’s genome sequencing has led to a race to sequence human genomes at a rapid speed and low cost, often referred to as the $1000 in one day sequence. Learn more by selecting the Sequencing at Speed animation here.

Gel electrophoresis is a technique used to separate DNA fragments of different sizes. Usually the gel is made of a chemical called agarose. Agarose powder is added to a buffer and heated. After cooling, the gel solution is poured into a casting tray. Once the gel has solidified, the DNA is loaded on the gel and electric current is applied. The DNA has a net negative charge and moves from the negative electrode toward the positive electrode. The electric current is applied for sufficient time to let the DNA separate according to size the smallest fragments will be farthest from the well (where the DNA was loaded), and the heavier molecular weight fragments will be closest to the well. Once the DNA is separated, the gel is stained with a DNA-specific dye for viewing it (Figure 14.9).

Evolution Connection

Neanderthal Genome: How Are We Related?

The first draft sequence of the Neanderthal genome was recently published by Richard E. Green et al. in 2010. 1 Neanderthals are the closest ancestors of present-day humans. They were known to have lived in Europe and Western Asia before they disappeared from fossil records approximately 30,000 years ago. Green’s team studied almost 40,000-year-old fossil remains that were selected from sites across the world. Extremely sophisticated means of sample preparation and DNA sequencing were employed because of the fragile nature of the bones and heavy microbial contamination. In their study, the scientists were able to sequence some four billion base pairs. The Neanderthal sequence was compared with that of present-day humans from across the world. After comparing the sequences, the researchers found that the Neanderthal genome had 2 to 3 percent greater similarity to people living outside Africa than to people in Africa. While current theories have suggested that all present-day humans can be traced to a small ancestral population in Africa, the data from the Neanderthal genome may contradict this view. Green and his colleagues also discovered DNA segments among people in Europe and Asia that are more similar to Neanderthal sequences than to other contemporary human sequences. Another interesting observation was that Neanderthals are as closely related to people from Papua New Guinea as to those from China or France. This is surprising because Neanderthal fossil remains have been located only in Europe and West Asia. Most likely, genetic exchange took place between Neanderthals and modern humans as modern humans emerged out of Africa, before the divergence of Europeans, East Asians, and Papua New Guineans.

Several genes seem to have undergone changes from Neanderthals during the evolution of present-day humans. These genes are involved in cranial structure, metabolism, skin morphology, and cognitive development. One of the genes that is of particular interest is RUNX2, which is different in modern day humans and Neanderthals. This gene is responsible for the prominent frontal bone, bell-shaped rib cage, and dental differences seen in Neanderthals. It is speculated that an evolutionary change in RUNX2 was important in the origin of modern-day humans, and this affected the cranium and the upper body.

Link to Learning

Watch Svante Pääbo’s talk explaining the Neanderthal genome research at the 2011 annual TED (Technology, Entertainment, Design) conference.

DNA Packaging in Cells

When comparing prokaryotic cells to eukaryotic cells, prokaryotes are much simpler than eukaryotes in many of their features (Figure 14.10). Most prokaryotes contain a single, circular chromosome that is found in an area of the cytoplasm called the nucleoid.

Visual Connection

In eukaryotic cells, DNA and RNA synthesis occur in a separate compartment from protein synthesis. In prokaryotic cells, both processes occur together. What advantages might there be to separating the processes? What advantages might there be to having them occur together?

The size of the genome in one of the most well-studied prokaryotes, E.coli, is 4.6 million base pairs (approximately 1.1 mm, if cut and stretched out). So how does this fit inside a small bacterial cell? The DNA is twisted by what is known as supercoiling. Supercoiling means that DNA is either under-wound (less than one turn of the helix per 10 base pairs) or over-wound (more than 1 turn per 10 base pairs) from its normal relaxed state. Some proteins are known to be involved in the supercoiling other proteins and enzymes such as DNA gyrase help in maintaining the supercoiled structure.

Eukaryotes, whose chromosomes each consist of a linear DNA molecule, employ a different type of packing strategy to fit their DNA inside the nucleus (Figure 14.11). At the most basic level, DNA is wrapped around proteins known as histones to form structures called nucleosomes. The histones are evolutionarily conserved proteins that are rich in basic amino acids and form an octamer. The DNA (which is negatively charged because of the phosphate groups) is wrapped tightly around the histone core. This nucleosome is linked to the next one with the help of a linker DNA. This is also known as the “beads on a string” structure. This is further compacted into a 30 nm fiber, which is the diameter of the structure. At the metaphase stage, the chromosomes are at their most compact, are approximately 700 nm in width, and are found in association with scaffold proteins.

In interphase, eukaryotic chromosomes have two distinct regions that can be distinguished by staining. The tightly packaged region is known as heterochromatin, and the less dense region is known as euchromatin. Heterochromatin usually contains genes that are not expressed, and is found in the regions of the centromere and telomeres. The euchromatin usually contains genes that are transcribed, with DNA packaged around nucleosomes but not further compacted.


As an Amazon Associate we earn from qualifying purchases.

Want to cite, share, or modify this book? This book is Creative Commons Attribution License 4.0 and you must attribute OpenStax.

    If you are redistributing all or part of this book in a print format, then you must include on every physical page the following attribution:

  • Use the information below to generate a citation. We recommend using a citation tool such as this one.
    • Authors: Connie Rye, Robert Wise, Vladimir Jurukovski, Jean DeSaix, Jung Choi, Yael Avissar
    • Publisher/website: OpenStax
    • Book title: Biology
    • Publication date: Oct 21, 2016
    • Location: Houston, Texas
    • Book URL:
    • Section URL:

    © Sep 15, 2020 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License 4.0 license. The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.


    In this work, we analyze more than 4000 WGS samples from 14 different pathogenic bacterial species to evaluate the extent and impact of contamination in bacterial WGS studies. We show that presence of sequencing reads from contaminating organisms is frequent, even when sequencing is performed from pure culture isolates (Fig. 1). Beyond inappropriate laboratory practices, there are several potential sources of contamination which depend on different factors such as the type of sample processed and its origin, or the protocols followed for culture, DNA extraction, and sequencing. For instance, Salter et al. demonstrated that contaminating DNA in laboratory reagents can critically impact microbiome analysis from low-biomass samples [19]. Culture-free sequencing approaches for unculturable or slow-growing pathogens, such as T. pallidum or MTB, entail the presence of high amounts of contaminating DNA from the host organism. Other sources unrelated to sample handling are also possible. For example, the S. aureus samples supposed to be MTB from the Nigeria study are most likely an error during data submission to the genomic repository. Regardless of the source of contamination, the shared consequence is the presence of non-target reads in the sequencing files that might impact the results of genomic analysis.

    We evaluated such an impact and demonstrate that contaminant reads suppose a pitfall in re-sequencing pipelines, since they are unexpectedly frequent and can have major implications in variant analysis, which is the foundation of many genomic analyses. As expected, contamination is a major issue when sequencing DNA that has not been extracted from pure cultures or single colonies, as is often the case for clinical specimens. However, we show that experiments sequencing from pure cultures are not necessarily free of contamination, and that using standard mapping quality parameters is not enough to deal with contaminant reads. Therefore, bioinformatic pipelines assuming that all the reads successfully mapped are from the target organism might lead to a biased variant analysis. We show that the errors introduced by contamination are very variable among different studies, (Table 2 Fig. 3 Fig. 5), which differ not only on the organism being sequenced but also on the sampling source and laboratory protocols. For example, in the T. pallidum study, where samples are heavily contaminated, very few differences are observed in the variant analysis. This stems from the fact that most of the contamination in this study comes from human reads, unlikely to map to the T. pallidum genome. On the contrary, for the L. pneumophila dataset, a sample with 96.27% of Legionella had 79 vSNPs and 5 fSNPs removed, and 17 fSNPs recovered after filtering a 3% of unclassified reads. According to the NCBI blast, a fraction of those reads was from Legionella spiritensis. The downstream relevance, however, is not directly proportional to the absolute number of erroneous SNPs and frequencies, but to what that errors mean for each organism. For example, for organisms with low genetic diversities, like in the case of MTB, a change in few fSNPs can have major implications in epidemiology studies since transmission cutoffs vary between 5 and 12 fSNPs [20]. This is also true when predicting drug resistance, particularly considering that many drug resistance-associated genes are conserved among bacteria and hence more prone to recruit contaminant mappings. Likewise, the higher impact observed for vSNPs, both in terms of absolute numbers and frequencies, can have large implications in those applications based on the analysis of the allele frequency spectrum, for example, when studying complex traits in bacterial populations. For instance, vSNPs are analyzed to determine heteroresistance to antibiotics [21], within host diversity of pathogens [22], host adaptation of bacteria [23], and even to delineate between patient transmission of pathogens [24]. While not specifically tested in our analysis, our results also have obvious implications in other applications which highly depend on the variation detected (e.g., cgMSLT typing) or when contaminant reads are incorporated in de novo assemblies [18].

    The main limitation of our study is that we have based our estimations on the taxonomic classifications of Kraken. However, taxonomic classifiers are known to misclassify a proportion of reads that are thus incorrectly identified as contaminants. We took into account several considerations to control for the potential biases in our analysis. Whereas Kraken is computationally expensive, its performance has been demonstrated in several studies to rank among the best up to date [25,26,27]. Secondly, since distinguishing between closely related species may be difficult, to be conservative, we performed the taxonomic filter at the genus level instead of species. Additionally, we estimated the error introduced by Kraken in our own setting, particularly regarding unclassified and misclassified reads, and showed that the error rate was very low (Additional file 1: Table S1 Additional file 2: Table S2 Additional file 3: Table S3), in agreement with published data. Despite these measures, we might have under- or overestimated the impact of contaminant reads in some cases. For example, by removing non-target reads at the level of genus, we might have underestimated the impact of potential contamination, given that contaminant reads from the same genus (but different species) are more likely to map to the reference genome and thus impact variant analysis. Our analysis also showed that Kraken might have overestimated the number of contaminant reads in some datasets due to, for example, exchange of genetic material between species (K. pneumoniae, S. aureus, S. enterica) or the absence of enough genetic diversity in the database.

    Altogether our results show that contaminant reads in re-sequencing experiments are frequent and can greatly bias variant analysis at a genomic level. However, based on our results, it is clear that different settings will require different contamination-control strategies that take into account the genetic particularities of each organism. Whereas the taxonomic filter we propose seems to perform well in many situations, in the case of highly diverse bacteria, other approaches might be better suited. For instance, coverage information and k-mer frequencies [28, 29] can be used to distinguish between target and contaminant reads when these are present in significantly different proportions. Similarly, detecting cross-contamination with strains of the same species is challenging and requires specific strategies. These strategies can include detection of vSNPs at lineage-defining positions, calculation of biased allele ratios [30], or Bayesian statistical modeling [31].

    Importantly, different implementations of such strategies should be extensively evaluated and validated. Here, we provide such an evaluation for the pathogen our laboratory is focused on: M. tuberculosis. In addition to the taxonomic filter, we evaluated a second contaminant filtering approach based on the similarity of the read alignments. In this case, the Kraken-based taxonomic filter clearly outperformed the similarity filter what is probably true for other organisms with representative genomes in the databases and moderate genetic diversities (Fig. 4, Fig. 5, Additional file 1: Tables S8 and S9, and Figure S1).

    The analyses for MTB reveal a large number of variants introduced by contaminants with downstream consequences when calling vSNPs and fSNPs as well as the wild type. Remarkably, we show that contamination can introduce substantial errors in samples that could be considered “pure” or with high sequencing depths, implying that contamination-aware pipelines will be needed in any circumstance.

    Contamination has been recognized as a major source of error in genome assemblies and other fields like metagenomics [16, 19]. However, the role of contamination in re-sequencing pipelines is usually neglected. Whereas some groups are already aware of this issue, most bacterial re-sequencing pipelines are still lacking contamination-control strategies or, if any, these are rarely detailed in published works. Based on our findings, we call for the inclusion of contamination control as a basic quality parameter and the use of validated contamination-aware pipelines in any bacterial WGS study. These analyses pipelines should be capable of, at least, reporting the contaminated samples and their contaminants to be later interpreted by the researcher. Ideally, they should be able to produce accurate results regardless of the extent of contamination of a sample. Pipelines capable of accurately analyzing contaminated WGS data will soon become essential, since the improvement of laboratory protocols allows the sequencing of an increasing number of bacterial species directly from clinical specimens [32, 33]. In this work, we provide a highly accurate contamination-aware pipeline for MTB WGS analysis that will be extremely helpful in the upcoming studies and clinical applications sequencing MTB directly from respiratory samples.

    Neanderthal Genome: How Are We Related?

    The first draft sequence of the Neanderthal genome was published by Richard E. Green et al. in 2010. [1] Neanderthals are the closest ancestors of present-day humans. They were known to have lived in Europe and Western Asia before they disappeared from fossil records approximately 30,000 years ago. Green&rsquos team studied almost 40,000-year-old fossil remains that were selected from sites across the world. Extremely sophisticated means of sample preparation and DNA sequencing were employed because of the fragile nature of the bones and heavy microbial contamination. In their study, the scientists were able to sequence some four billion base pairs. The Neanderthal sequence was compared with that of present-day humans from across the world. After comparing the sequences, the researchers found that the Neanderthal genome had 2 to 3 percent greater similarity to people living outside Africa than to people in Africa. While current theories have suggested that all present-day humans can be traced to a small ancestral population in Africa, the data from the Neanderthal genome may contradict this view. Green and his colleagues also discovered DNA segments among people in Europe and Asia that are more similar to Neanderthal sequences than to other contemporary human sequences. Another interesting observation was that Neanderthals are as closely related to people from Papua New Guinea as to those from China or France. This is surprising because Neanderthal fossil remains have been located only in Europe and West Asia. Most likely, genetic exchange took place between Neanderthals and modern humans as modern humans emerged out of Africa, before the divergence of Europeans, East Asians, and Papua New Guineans.

    Several genes seem to have undergone changes from Neanderthals during the evolution of present-day humans. These genes are involved in cranial structure, metabolism, skin morphology, and cognitive development. One of the genes that is of particular interest is RUNX2, which is different in modern day humans and Neanderthals. This gene is responsible for the prominent frontal bone, bell-shaped rib cage, and dental differences seen in Neanderthals. It is speculated that an evolutionary change in RUNX2 was important in the origin of modern-day humans, and this affected the cranium and the upper body.

    Watch Svante Pääbo&rsquos talk explaining the Neanderthal genome research at the 2011 annual TED (Technology, Entertainment, Design) conference.

    17.3 Whole-Genome Sequencing

    Although there have been significant advances in the medical sciences in recent years, doctors are still confounded by some diseases, and they are using whole-genome sequencing to get to the bottom of the problem. Whole-genome sequencing is a process that determines the DNA sequence of an entire genome. Whole-genome sequencing is a brute-force approach to problem solving when there is a genetic basis at the core of a disease. Several laboratories now provide services to sequence, analyze, and interpret entire genomes.

    For example, whole-exome sequencing is a lower-cost alternative to whole genome sequencing. In exome sequencing, only the coding, exon-producing regions of the DNA are sequenced. In 2010, whole-exome sequencing was used to save a young boy whose intestines had multiple mysterious abscesses. The child had several colon operations with no relief. Finally, whole-exome sequencing was performed, which revealed a defect in a pathway that controls apoptosis (programmed cell death). A bone-marrow transplant was used to overcome this genetic disorder, leading to a cure for the boy. He was the first person to be successfully treated based on a diagnosis made by whole-exome sequencing. Today, human genome sequencing is more readily available and can be completed in a day or two for about $1000.

    Strategies Used in Sequencing Projects

    The basic sequencing technique used in all modern day sequencing projects is the chain termination method (also known as the dideoxy method), which was developed by Fred Sanger in the 1970s. The chain termination method involves DNA replication of a single-stranded template with the use of a primer and a regular deoxynucleotide (dNTP), which is a monomer, or a single unit, of DNA. The primer and dNTP are mixed with a small proportion of fluorescently labeled dideoxynucleotides (ddNTPs). The ddNTPs are monomers that are missing a hydroxyl group (–OH) at the site at which another nucleotide usually attaches to form a chain (Figure 17.13). Each ddNTP is labeled with a different color of fluorophore. Every time a ddNTP is incorporated in the growing complementary strand, it terminates the process of DNA replication, which results in multiple short strands of replicated DNA that are each terminated at a different point during replication. When the reaction mixture is processed by gel electrophoresis after being separated into single strands, the multiple newly replicated DNA strands form a ladder because of the differing sizes. Because the ddNTPs are fluorescently labeled, each band on the gel reflects the size of the DNA strand and the ddNTP that terminated the reaction. The different colors of the fluorophore-labeled ddNTPs help identify the ddNTP incorporated at that position. Reading the gel on the basis of the color of each band on the ladder produces the sequence of the template strand (Figure 17.14).

    Early Strategies: Shotgun Sequencing and Pair-Wise End Sequencing

    In shotgun sequencing method, several copies of a DNA fragment are cut randomly into many smaller pieces (somewhat like what happens to a round shot cartridge when fired from a shotgun). All of the segments are then sequenced using the chain-sequencing method. Then, with the help of a computer, the fragments are analyzed to see where their sequences overlap. By matching up overlapping sequences at the end of each fragment, the entire DNA sequence can be reformed. A larger sequence that is assembled from overlapping shorter sequences is called a contig . As an analogy, consider that someone has four copies of a landscape photograph that you have never seen before and know nothing about how it should appear. The person then rips up each photograph with their hands, so that different size pieces are present from each copy. The person then mixes all of the pieces together and asks you to reconstruct the photograph. In one of the smaller pieces you see a mountain. In a larger piece, you see that the same mountain is behind a lake. A third fragment shows only the lake, but it reveals that there is a cabin on the shore of the lake. Therefore, from looking at the overlapping information in these three fragments, you know that the picture contains a mountain behind a lake that has a cabin on its shore. This is the principle behind reconstructing entire DNA sequences using shotgun sequencing.

    Originally, shotgun sequencing only analyzed one end of each fragment for overlaps. This was sufficient for sequencing small genomes. However, the desire to sequence larger genomes, such as that of a human, led to the development of double-barrel shotgun sequencing, more formally known as pairwise-end sequencing . In pairwise-end sequencing, both ends of each fragment are analyzed for overlap. Pairwise-end sequencing is, therefore, more cumbersome than shotgun sequencing, but it is easier to reconstruct the sequence because there is more available information.

    Next-generation Sequencing

    Since 2005, automated sequencing techniques used by laboratories are under the umbrella of next-generation sequencing , which is a group of automated techniques used for rapid DNA sequencing. These automated low-cost sequencers can generate sequences of hundreds of thousands or millions of short fragments (25 to 500 base pairs) in the span of one day. These sequencers use sophisticated software to get through the cumbersome process of putting all the fragments in order.

    Evolution Connection

    Comparing Sequences

    A sequence alignment is an arrangement of proteins, DNA, or RNA it is used to identify regions of similarity between cell types or species, which may indicate conservation of function or structures. Sequence alignments may be used to construct phylogenetic trees. The following website uses a software program called BLAST (basic local alignment search tool).

    Under “Basic Blast,” click “Nucleotide Blast.” Input the following sequence into the large "query sequence" box: ATTGCTTCGATTGCA. Below the box, locate the "Species" field and type "human" or "Homo sapiens". Then click “BLAST” to compare the inputted sequence against known sequences of the human genome. The result is that this sequence occurs in over a hundred places in the human genome. Scroll down below the graphic with the horizontal bars and you will see short description of each of the matching hits. Pick one of the hits near the top of the list and click on "Graphics". This will bring you to a page that shows where the sequence is found within the entire human genome. You can move the slider that looks like a green flag back and forth to view the sequences immediately around the selected gene. You can then return to your selected sequence by clicking the "ATG" button.

    Use of Whole-Genome Sequences of Model Organisms

    The first genome to be completely sequenced was of a bacterial virus, the bacteriophage fx174 (5368 base pairs) this was accomplished by Fred Sanger using shotgun sequencing. Several other organelle and viral genomes were later sequenced. The first organism whose genome was sequenced was the bacterium Haemophilus influenzae this was accomplished by Craig Venter in the 1980s. Approximately 74 different laboratories collaborated on the sequencing of the genome of the yeast Saccharomyces cerevisiae, which began in 1989 and was completed in 1996, because it was 60 times bigger than any other genome that had been sequenced. By 1997, the genome sequences of two important model organisms were available: the bacterium Escherichia coli K12 and the yeast Saccharomyces cerevisiae. Genomes of other model organisms, such as the mouse Mus musculus, the fruit fly Drosophila melanogaster, the nematode Caenorhabditis. elegans, and humans Homo sapiens are now known. A lot of basic research is performed in model organisms because the information can be applied to genetically similar organisms. A model organism is a species that is studied as a model to understand the biological processes in other species represented by the model organism. Having entire genomes sequenced helps with the research efforts in these model organisms. The process of attaching biological information to gene sequences is called genome annotation . Annotation of gene sequences helps with basic experiments in molecular biology, such as designing PCR primers and RNA targets.

    Link to Learning

    Click through each step of genome sequencing at this site.

    Uses of Genome Sequences

    DNA microarrays are methods used to detect gene expression by analyzing an array of DNA fragments that are fixed to a glass slide or a silicon chip to identify active genes and identify sequences. Almost one million genotypic abnormalities can be discovered using microarrays, whereas whole-genome sequencing can provide information about all six billion base pairs in the human genome. Although the study of medical applications of genome sequencing is interesting, this discipline tends to dwell on abnormal gene function. Knowledge of the entire genome will allow future onset diseases and other genetic disorders to be discovered early, which will allow for more informed decisions to be made about lifestyle, medication, and having children. Genomics is still in its infancy, although someday it may become routine to use whole-genome sequencing to screen every newborn to detect genetic abnormalities.

    In addition to disease and medicine, genomics can contribute to the development of novel enzymes that convert biomass to biofuel, which results in higher crop and fuel production, and lower cost to the consumer. This knowledge should allow better methods of control over the microbes that are used in the production of biofuels. Genomics could also improve the methods used to monitor the impact of pollutants on ecosystems and help clean up environmental contaminants. Genomics has allowed for the development of agrochemicals and pharmaceuticals that could benefit medical science and agriculture.

    It sounds great to have all the knowledge we can get from whole-genome sequencing however, humans have a responsibility to use this knowledge wisely. Otherwise, it could be easy to misuse the power of such knowledge, leading to discrimination based on a person's genetics, human genetic engineering, and other ethical concerns. This information could also lead to legal issues regarding health and privacy.

    Shed skin as a source of DNA for genotyping-by-sequencing (GBS) in reptiles

    Association and genetic mapping studies aimed at linking genotype to phenotype are powerful tools that require large numbers of samples, complicating their use in long-lived species with low fecundity. Shed skins of snakes and other reptiles contain DNA are a safe and ethical way of non-invasively sampling large numbers of individuals and provide a simple mechanism by which to involve the public in scientific research. Here we test whether the DNA in dried shed skins mailed to us from citizen scientists is suitable for reduced representation sequencing approaches, specifically genotyping-by-sequencing (GBS). We find that shed skin samples provide DNA of sufficient quality and quantity for GBS, although libraries from shed skin resulted in fewer sequenced reads than libraries from snap-frozen muscle, and contained slightly fewer variants (70,685 SNPs versus 97,724). This issue is a direct result of lower read counts of the shed skin samples, and can be rectified quite simply with deeper sequencing. Skin-derived libraries also have a very slight (but significantly different) profile of transitions and transversions, suggesting that DNA damage occurs but is minimal. We conclude that shed skin-derived DNA is a good source of genomic DNA for a variety of genetic studies, and use it to identify sex-linked scaffolds in the corn snake genome.

    B. Details of DiDeoxy Sequencing

    Given a template DNA (e.g., a plasmid cDNA), Sanger used in vitro replication protocols to demonstrate that he could:

    1. Replicate DNA under conditions that randomly stopped nucleotide addition at every possible position in growing strands.
    2. Separate and then detect these DNA fragments of replicated DNA.

    Recall that DNA polymerases catalyze the formation of phosphodiester bonds by linking the (alpha ) phosphate of a nucleotide triphosphate to the free 3&rsquo OH of a deoxynucleotide at the end of a growing DNA strand. Recall also that the ribose sugar in the deoxynucleotide precursors of replication lack a 2&rsquo OH (hydroxyl) group. Sanger&rsquos trick was to add dideoxynucleotide triphosphates to his in vitro replication mix. The ribose on a dideoxynucleotide triphosphate (ddNTP) lacks a 3&rsquo OH, in addition to the 2&rsquo OH group (as shown below).

    Adding a dideoxynucleotide to a growing DNA strand stops replication. No further nucleotides can add to the 3&rsquo-end of the replicating DNA strand because the 3&rsquo&ndashOH necessary for the dehydration synthesis of the next phosphodiester bond is absent! Because they can stop replication in actively growing cells, ddNTPs such as dideoxyadenosine (tradename, cordycepin) are anti-cancer chemotherapeutic drugs.

    A look at a manual DNA sequencing protocol reveals what is going on in the sequencing reactions. Four reaction tubes are set up, each containing the template DNA to be sequenced, a primer of known sequence and the four required deoxynucleotide precursors necessary for replication.

    The set-up for manual DNA sequencing is shown below.

    A different ddNTP, (ddATP, ddCTP, ddGTP or ddTTP) is added to each of the four tubes. Finally, DNA polymerase is added to each tube to start the DNA synthesis reaction. During DNA synthesis, different length fragments of new DNA accumulate as the ddNTPs incorporate randomly, opposite complementary bases in the template DNA being sequenced. The expectations of the didieoxy sequencing reactions in the four tubes are illustrated below.

    A short time after adding the DNA polymerase to begin the reactions, the mixture is heated to separate the DNA strands and fresh DNA polymerase is added to repeat the synthesis reactions. These sequencing reactions are repeated as many as 30 times in order to produce enough radioactive DNA fragments to be detected. When the heat-stable Taq DNA polymerase from the thermophilic bacterium Thermus aquaticus became available ( more later!), it was no longer necessary to add fresh DNA polymerase after each replication cycle. The many heating and cooling cycles required for what became known as chain-termination DNA sequencing were soon automated using inexpensive programmable thermocyclers.

    Since a small amount of a radioactive deoxynucleotide (usually 32P-labeled ATP) was present in each reaction tube, the newly made DNA fragments are radioactive. After electrophoresis to separate the new DNA fragments in each tube, autoradiography of the electrophoretic gel reveals the position of each terminated fragment. The DNA sequence can then be read from the gel as illustrated in the simulated autoradiograph below.

    As shown in the cartoon, the DNA sequence can be read by reading the bases from the bottom of the gel, starting with the C at the bottom of the C lane. Try reading the sequence yourself!

    The first semi-automated DNA sequencing method was invented in Leroy Hood&rsquos California lab in 1986. Though still Sanger sequencing, the four dideoxynucleotides in the sequencing reaction were tagged for detection with a fluorescent dyes instead radioactive phosphate-tagged nucleotides. After the sequencing reactions, the reaction products are electrophoresed on an &lsquoautomated DNA sequencer&rsquo. UV light excites the migrating dye-terminated DNA fragments as they pass through a detector. The color of their fluorescence is detected, processed and sent to a computer, generating color-coded graph like the one below, showing the order (and therefore length) of fragments passing the detector and thus, the sequence of the strand.

    A most useful feature of this sequencing method is that a template DNA could be sequenced in a single tube, containing all the required components, including all four dideoxynucleotides! That&rsquos because the fluorescence detector in the sequencing machine separately sees all the short ddNTP-terminated fragments as they move through the electrophoretic gel.

    Hood&rsquos innovations were quickly commercialized making major sequencing projects possible, including whole genome sequencing. The rapidity of automated DNA sequencing led to the creation of large sequence databases in the U.S. and Europe.

    The NCBI (National Center for Biological Information) maintains the U.S. database. Despite its location, the NCBI archives virtually all DNA sequences determined worldwide. New &lsquotiny&rsquo DNA sequencers have made sequencing DNA so portable that in 2016, one was even used in the International Space Station. Expanding databases and new tools and protocols (some are described below) to find, compare and analyze DNA sequences have also grown rapidly.

    Biologists unravel full sequence of DNA repair mechanism

    Credit: Photo by Daniil Kuzelev/Unsplash.

    Every living organism has DNA, and every living organism engages in DNA replication, the process by which DNA makes an exact copy of itself during cell division. While it's a tried-and-true process, problems can arise.

    Break-induced replication (BIR) is a way to solve those problems. In humans, it is employed chiefly to repair breaks in DNA that cannot be fixed otherwise. Yet BIR itself, through its repairs to DNA and how it conducts those repairs, can introduce or cause genomic rearrangements and mutations contributing to cancer development.

    "It's kind of a double-edged sword," says Anna Malkova, professor in the Department of Biology at the University of Iowa, who has studied BIR since 1995. "The basic ability to repair is a good thing, and some DNA breaks can't be repaired by other methods. So, the idea is very good. But the outcomes can be bad."

    A new study led by Malkova, published Jan. 20 in the journal Nature, seeks to tease out BIR's high risk-reward arrangement by describing for the first time the beginning-to-end sequence in BIR. The biologists developed a new technique that enabled them to study in a yeast model how BIR operates throughout its repair cycle. Until now, scientists had only been able to study BIR's operations at the beginning and end stages. The researchers then introduced obstructions with DNA replication, such as transcription—the process of copying DNA to produce proteins—that are believed to be aided by BIR.

    "Our study shows that when BIR comes to the rescue at these collisions, its arrival comes at a very high price," says Malkova, the study's corresponding author. "When BIR meets transcription, it can introduce even more instability, which can lead to even higher mutations. As a result, we think that instabilities that mainly were found at collisions between transcription and replication that have been suggested to lead to cancer might be caused by BIR that came to the rescue. It comes, it rescues, but it's kind of questionable how helpful it really is."

    Scientists have known how BIR works at some stages. For example, they know the DNA repair apparatus forms a bubble of sorts around the damaged DNA, then moves forward, unzipping the DNA, copying intact segments, and finally transferring those copied segments to a new DNA strand.

    But what remained elusive was following BIR throughout its entire repair cycle. Using a technique involving Droplet Digital PCR and a new DNA purification method developed by biology graduate student Liping Liu, the researchers were able to observe BIR from beginning to end.

    "If you imagine this as a train, Liping installed a bunch of stations, and she watched how the train proceeded at each station, tracking the increase in DNA at each station, how much increase is occurring at each station, and thus, in aggregate, how the entire process unfolds," Malkova explains.

    The team then intentionally introduced obstructions at some stations—transcription and another obstruction called internal telomere sequences—to observe how BIR responded to the obstacles. One finding: when transcription is introduced near the beginning of the BIR process, the repairs fail to commence, as if they're being suppressed. Also, the researchers found the orientation of the transcription with respect to BIR can affect the repair cycle and may be an important factor affecting instability that can promote cancer in humans.

    "Scientists already know there's a lot of instability in places where high transcription meets normal replication," Malkova says. "What we did not know until now is where is it coming from and why is it happening."

    The first author of the study, "Tracking break-induced replication shows that it stalls at roadblocks," is Liu, who is a sixth-year graduate student in Malkova's lab.


  1. Mordred

    Wonderful, very precious thing

  2. Adio

    Let's write more. Many people like your posts. Respect from the bottom of my heart.

  3. Vobar

    Greetings! It's not the first day I've been reading this page. But the connection speed is lame. How can you subscribe to your RSS feed? I would like to read you further.

  4. Ponce

    I apologise, but, in my opinion, you are mistaken.

  5. Bursone

    Also what in that case it is necessary to do?

Write a message