Information

What exactly are computers used for in DNA sequencing?

What exactly are computers used for in DNA sequencing?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I've thoroughly read the Wikipedia article on DNA sequencing and can't get one thing.

There's some hardcore chemistry involved in the process that somehow splits the DNA and then isolates its parts.

Yet DNA sequencing is considered to be a very computationally-intensive process. I don't get what exactly is being computed there - what data comes into computers and what computers compute specifically.

What exactly is being computed there? Where do I get more information on this?


Computers are used in several steps of sequencing, from the raw data to finished sequence:

Image processing

Modern sequencers usually use fluorescent labelling of DNA fragments in solution. The fluorescence encodes the different nucleobase (= “base”) types (generally called A, C, G and T). To achieve high throughput, millions of sequencing reactions are performed in parallel in microscopic quantities on a glass chip, and for each micro-reaction, the label needs to be recorded at each step in the reaction.

This means: the sequencer takes progressive digital photographs of the chip containing the sequencing reagent. These photos have differently coloured pixels which need to be told apart and assigned a specific colour value.

As can be seen, this (strongly magnified; the image is < 100 µm across!) image fragment is fuzzy and many of the dots overlap. This makes it hard to determine which colour to assign to which pixel (though more recent versions of the sequencing machine have improved focussing systems, and the image is consequently crisper).

Base calling

One such image is registered for each step of the sequencing process, yielding one image for each base of the fragments. For a fragment of length 75, that'd be 75 images.

Once you have analysed the images, you get colour spectra for each pixel across the images. The spectra for each pixel correspond to one sequence fragment (often called a “read”) and are considered separately. So for each fragment you get such a spectrum:

(This image is generated by an alternative sequencing process called Sanger sequencing but the principle is the same.)

Now you need to decide which base to assign for each position based on the signal (“base calling”). For most positions this is fairly easy but sometimes the signal overlaps or decays significantly. This has to be considered when deciding the base calling quality (i.e. which confidence you assign to your decision for a given base).

Doing this for each read yields up to billions of reads, each representing a short fragment of the original DNA that you sequenced.

Most bioinformatics analysis starts here; that is, the machines emit files containing the short sequence fragments. Now we need to make a sequence from them.

Read mapping and assembly

The key point that allows retrieving the original sequence from these small fragments is the fact that these fragments are (non-uniformly) randomly distributed over the genome, and they are overlapping.

The next step depends on whether you have a similar, already sequenced genome at hand. Often, this is the case. For instance, there is a high-quality “reference sequence” of the human genome and since all the genomic sequences of all humans are ~99.9% identical (depending on how you count), you can simply look where your reads align to the reference.

Read mapping

This is done to search for single changes between the reference and your currently studied genome, for example to detect mutations that lead to diseases.

So all you have to do is to map the reads back to their original location in the reference genome (in blue) and look for differences (such as base pair differences, insertions, deletions, inversions… ).

Two points make this hard:

  1. You have got billions (!) of reads, and the reference genome is often several gigabytes large. Even with the fastest thinkable implementation of string search, this would take prohibitively long.

  2. The strings don't match precisely. First of all, there are of course differences between the genomes - otherwise, you wouldn't sequence the data at all, you'd already have it! Most of these differences are single base pair differences - SNPs (= single nucleotide polymorphisms) - but there are also larger variations that are much harder to deal with (and they are often ignored in this step).

    Furthermore, the sequencing machines aren't perfect. A lot of things influence the quality, first and foremost the quality of the sample preparation, and minute differences in the chemistry. All this leads to errors in the reads.

In summary, you need to find the position of billions of small strings in a larger string which is several gigabytes in size. All this data doesn't even fit into a normal computer's memory. And you need to account for mismatches between the reads and the genome.

Unfortunately, this still doesn't yield the complete genome. The main reason is that some regions of the genome are highly repetitive and badly conserved, so that it's impossible to map reads uniquely to such regions.

As a consequence, you instead end up with distinct, contiguous blocks (“contigs”) of mapped reads. Each contig is a sequence fragment, like reads, but much larger (and hopefully with less errors).

Assembly

Sometimes you want to sequence a new organism so you don't have a reference sequence to map to. Instead, you need to do a de novo assembly. An assembly can also be used to piece contigs from a mapped reads together (but different algorithms are used).

Again we use the property of the reads that they overlap. If you find two fragments which look like this:

ACGTCGATCGCTAGCCGCATCAGCAAACAACACGCTACAGCCT ATCCCCAAACAACACGCTACAGCCTGGCGGGGCATAGCACTGG

You can be quite certain that they overlap like this in the genome:

ACGTCGATCGCTAGCCGCATCAGCAAACAACACGCTACAGCCT ATCCCCATTCAACACGCTA-AGCTTGGCGGGGCATACGCACTG

(Notice again that this isn't a perfect match.)

So now, instead of searching for all the reads in a reference sequencing, you search for head-to-tail correspondences between reads in your collection of billions of reads.

If you compare the mapping of a read to searching a needle in a haystack (an often used analogy), then assembling reads is akin to comparing all the straws in the haystack to each other straw, and putting them in order of similarity.


Think about it like this. Suppose you own a hundred copies of "The Lord of the Rings", a 500000 word novel. Unfortunately, you have those hundred copies in the form of several million tiny scraps of paper, each of which contains about ten sequential words from the novel. Your task is to take those several million scraps of paper and put them in order so that you can read the novel from start to finish. Suppose for example you find the fragment

stab that vile creature, when he had a chance!" "Pity?

You could then search the other several million fragments for a fragment that overlaps this in some way. Perhaps you find

chance!" "Pity? It was Pity that stayed his hand. Pity, and Mercy:

Odds are extremely good that those fragments go together into

stab that vile creature, when he had a chance!" "Pity? It was Pity that stayed his hand. Pity, and Mercy:

But maybe not! Maybe either (1) there is another fragment of the novel that haschance!" "Pity?that is the correct overlap, or oh, by the way did I mention (2) some of those scraps of paper contain errors, and you have to also detect and eliminate them.

That is an extremely computationally intensive job. DNA assemblers have the same problem: millions upon millions of of tiny scraps of DNA that overlap, that might contain errors, and that need to be sorted into order by analyzing their overlaps and gradually building up short fragments into longer and longer fragments.


In a genome, there are usually billions of base pairs. However, it's impossible to read all of them in one go. The DNA is fragmented, and the sequence of the fragments is determined. Next-generation sequencing techniques are faster and cheaper, but produce only short fragments (say, 100 base pairs, this depends on the technology). It's extremely computationally intensive to put these fragments back together.

More info: Genome sequence assembly primer; intro in Nature Methods


As you mentioned in the question, current sequencing platforms split the genomic DNA into many small pieces which the machine then analyzes. The product of a sequencing experiment is millions or even billions of short "reads"---strings of A, C, G, and T representing the nucleotides of a single fragment of DNA.

The DNA reads in this form aren't particularly useful. The idea in the first place was to determine the sequence of the entire DNA molecule. This is where genome assembly software comes in---to determine the original sequence of the genomic DNA by finding the optimal arrangement of overlapping reads to reconstruct the original DNA sequence.

Computers are crucial at 2 stages of this process---first, in the sequencing experiment itself, the platform must record and interpret fluorescent signals to generate the sequence reads in the first place; and second, very powerful computers are needed to assemble the reads back into a contiguous sequence to recover the original DNA sequence.


How do we Sequence DNA?

DNA We assume you've read through the description of DNA structure, an earlier link in this thread . right? You hopefully also read the link that describes DNA Denaturation, Annealing and Replication, since the following page builds on those basics.

Plasmid A 'plasmid' is a small, circular piece of DNA that is often found in bacteria. This innocuous molecule might help the bacteria survive in the presence of an antibiotic, for example, due to the genes it carries. To scientists, however, plasmids are important because (i) we can isolate them in large quantities, (ii) we can cut and splice them, adding whatever DNA we choose, (iii) we can put them back into bacteria, where they'll replicate along with the bacteria's own DNA, and (iv) we can isolate them again - getting billions of copies of whatever DNA we inserted into the plasmid! Plasmid are limited to sizes of 2.5-20 kilobases (kb), in general.

BAC The term 'BAC" is an acronym for 'Bacterial Artificial Chromosome', and in principle, it is used like a plasmid. We construct BACs that carry DNA from humans or mice or wherever, and we insert the BAC into a host bacterium. As with the plasmid, when we grow that bacterium, we replicate the BAC as well. Huge pieces of DNA can be easily replicated using BACs - usually on the order of 100-400 kilobases (kb). Using BACs, scientists have cloned (replicated) major chunks of human DNA. This, as you will see later, is critical to the Human Genome Project.

Vector The 'vector' is generally the basic type of DNA molecule used to replicate your DNA, like a plasmid or a BAC.

Insert The 'insert' is a piece of DNA we've purposely put into another (a 'vector') so that we can replicate it. Usually the 'insert' is the interesting part, consequently. In the case of the Human Genome Project or other sequencing projects, the insert is the part we want to sequence - the part we don't know. Usually we know the complete DNA sequence of the vector.

Shotgun Sequencing Shotgun sequencing is a method for determining the sequence fo a very large piece of DNA. The basic DNA sequencing reaction can only get the sequence of a few hundred nucleotides. For larger ones (like BAC DNA), we usually fragment the DNA and insert the resultant pieces into a convenient vector (a plasmid, usually) to replicate them. After we sequence the fragments, we try to deduce from them the sequence of the original BAC DNA.


Techniques in Sequencing

Automated DNA Sequencing

Automated sequencing has been developed to sequence a really large amount of DNA. This procedure uses the principle of the Sanger chain-termination method. Instead of labeling dATP in the original Sanger method, each of the dideoxynucleotides used in the reaction is labeled with a different fluorescent marker.

After the extension reactions and chain termination are completed, to work out the DNA sequence, the mixture is loaded into a well of a polyacrylamide slab gel, or into a tube of a capillary gel system, and electrophoresis is carried out to separate the molecules according to their lengths. After separation, the molecules are run past a fluorescent detector capable of discriminating the labels attached to the dideoxynucleotides ( Fig. 11.3 ). The detector therefore determines if each molecule ends in an A, C, G, or T. The sequence can be printed out for examination, or entered directly into a storage device for future analysis. The entire dideoxynucleotide sequencing process has been automated to increase the rate of acquisition of DNA sequence data. This is essential for large-scale sequencing projects such as those involving whole prokaryotic or eukaryotic genomes.

Fig. 11.3 . The process of automated sequencing. The procedure is very similar to Sanger's chain termination. (A) Each of the dideoxynucleotides used in the reaction is labeled with a different fluorescent marker. The chain-termination sequencing is performing in one tube and this tube contain four fluorescent labeled dideoxynucleotides. The numbers in front of each nucleotide chain represent the length of that corresponding nucleotide chain in each reaction. (B) After the extension reactions, the mixture is loaded into a well of a polyacrylamide slab gel, or into a tube of a capillary gel system, and electrophoresis is carried out to separate the molecules according to their lengths. After separation, the molecules are run past a fluorescent detector capable of discriminating between the labels attached to the dideoxynucleotides.


Principles of Identification of DNA Sequence: 3 Principles

This article throws light upon the three principles of identification of DNA sequence. The three principles are: (1) Nucleic Acid Hybridization (2) DNA Probes and (3) DNA Chip-Microarray of Gene Probes.

Principle # 1. Nucleic Acid Hybridization:

Hybridization of nucleic acids (particularly DNA) is the basis for reliable DNA analysis. Hybridization is based on the principle that a single-stranded DNA molecule recognizes and specifically binds to a complementary DNA strand amid a mixture of other DNA strands. This is comparable to a specific key and lock relationship. The general procedure adopted for nucleic acid hybridization is as follows (Fig. 14.1).

The single-stranded target DNA is bound to a membrane support. Now the DNA probe (single- stranded and labeled with a detector substance) is added. Under appropriate conditions (temperature, ionic strength), the DNA probe pairs with the complementary target DNA.

The unbound DNA probe is removed. Sequence of nucleotides in the target DNA can be identified from the known sequence of DNA probe. There are two types of DNA hybridization-radioactive and non-radioactive respectively using DNA probes labeled with isotopes and non-isotopes as detectors.

Principle # 2. DNA Probes:

A DNA probe or a gene probe is a synthetic, single-stranded DNA molecule that can recognize and specifically bind to a target DNA (by complementary base pairing) in a mixture of biomolecules. DNA probes are either long (> 100 nucleotides) or short (< 50 nucleotides), and may bind to the total or a small portion of the target DNA. There is a wide variation in the size of DNA probes used (may range from 10 bases to 10,000 bases). The most important requirement is their specific and stable binding with target DNAs.

Methods Employed to Obtain DNA Probes:

A great majority of DNA probes are chemically synthesized in the laboratory. There are, however, many other ways of obtaining them-isolation of selected regions of genes, cloning of intact genes, producing from mRNAs.

Isolation of selected regions of genes:

The DNA from an organism (say a pathogen) can be cut by using restriction endonucleases. These DNA fragments are cloned in vectors and the DNA probes can be selected by screening.

Synthesis of DNA probes from mRNA:

The mRNA molecules specific to a particular DNA sequence (encoding a protein) are isolated. By using the enzyme reverse transcriptase, complementary DNA (cDNA) molecules are synthesized. This cDNA can be used as a probe to detect the target DNA.

Mechanism of Action of DNA Probes:

The basic principle of DNA probes is based on the denaturation and renaturation (hybridization) of DNA. When a double-stranded DNA molecule is subjected to physical (temperature > 95°C or pH < 10.5) or chemical (addition of urea or formaldehyde) changes, the hydrogen bonds break and the complementary strands get separated.

This process is called denaturation. Under suitable conditions (i.e., temperature, pH, salt concentration), the two separated single DNA strands can reassemble to form the original double-stranded DNA, and this phenomenon is referred to as renaturation or hybridization.

Radioactive Detection System:

The DNA probe is usually tagged with a radioactive isotope (commonly phosphorus-32). The target DNA is purified and denatured, and mixed with DNA probe. The isotope labeled DNA molecules specifically hybridizes with the target DNA (Fig. 14.1).

The non-hybridized probe DNA is washed away. The presence of radioactivity in the hybridized DNA can be detected by autoradiography. This reveals the presence of any bound (hybridized) probe molecules and thus the complementary DNA sequences in the target DNA.

Non-radioactive Detection System:

The disadvantage with the use of radioactive label is that the isotopes have short half-lives and involve risks in handling, besides requiring special laboratory equipment. So, non-radioactive detection systems (e.g., biotinylation) have also been developed. Biotin-labeled (biotinylated) nucleotides are incorporated into the DNA probe. The detection system is based on the enzymatic conversion of a chromogenic (colour producing) or chemiluminescent (light emitting) substrates.

The procedure commonly adopted for chemiluminescent detection of target DNA is depicted in Fig. 14.2. A biotin labeled DNA probe is hybridized to the target DNA. The egg white protein avidin or its bacterial analog streptavidin is added to bind to biotin. Now a biotin labeled enzyme, such as alkaline phosphatase is added which attaches to avidin or streptavidin. These proteins have four separate biotin-binding sites.

Thus, a single molecule (avidin or streptavidin) can bind to biotin-labeled DNA probe as well as biotin-labeled enzyme. On the addition of a chemiluminescent substrate, the enzyme alkaline phosphatase acts and converts it to a light emitting product which can be measured.

The biotin-labeled DNA is quite stable at room temperature for about one year. The detection devices using chemiluminescence are preferred, since they are as sensitive as radioisotope detection, and more sensitive than the use of chromogenic detection systems.

PCR in the use of DNA probes:

DNA probes can be successfully used for the identification of target DNAs from various samples — blood, urine, feces, tissues, throat washings without much purification. The detection of target sequence becomes quite difficult if the quantity of DNA is very low. In such a case, the polymerase chain reaction (PCR) is first employed to amplify the minute quantities of target DNA and identified by a DNA probe.

DNA Probes and Signal Amplification:

Signal amplification is an alternative to PCR for the identification of minute quantities of DNA by using DNA probes. In case of PCR, target DNA is amplified, while in signal amplification it is the target DNA bound to DNA probe that is amplified.

There are two general methods to achieve signal amplification.

1. Separate the DNA target—DNA probe complex from the rest of the DNA molecules, and then amplify it.

2. Amplify the DNA probe (bound to target DNA) by using a second probe. The RNA complementary to the DNA probe can serve as the second probe. The RNA-DNA-DNA complex can be separated and amplified. The enzyme O-beta replicase which catalyses RNA replication is commonly used.

Principle # 3. The DNA Chip-Microarray of Gene Probes:

The DNA chip or Gene-chip contains thousands of DNA probes (4000,000 or even more) arranged on a small glass slide of the size of a postage stamp. By this recent and advanced approach, thousands of target DNA molecules can be scanned simultaneously.

Technique for Use of DNA Chip:

The unknown DNA molecules are cut into fragments by restriction endonucleases. Fluorescent markers are attached to these DNA fragments. They are allowed to react to the probes of the DNA chip. Target DNA fragments with complementary sequences bind to DNA probes. The remaining DNA fragments are washed away. The target DNA pieces can be identified by their fluorescence emission by passing a laser beam. A computer is used to record the pattern of fluorescence emission and DNA identification.

The technique of employing DNA chips is very rapid, besides being sensitive and specific for the identification of several DNA fragments simultaneously. Scientists are trying to develop Gene-chips for the entire genome of an organism.

Applications of DNA Chip:

The presence mutations in a DNA sequence can be conveniently identified. In fact, Gene-chip probe array has been successfully used for the detection of mutations in the p53 and BRCA I genes. Both these genes are involved in cancer.


Did you know that your genome contains about six billion individual building blocks - and that we can now read the order of all those building blocks in about a day and for about $1000? Leaps in technology since the Human Genome Project have enabled remarkable genomics-based advances in medicine, agriculture, forensics, and our understanding of evolution.

Our genome (that is, our DNA "blueprint") - and in fact the genomes of all life forms on earth - are made of four chemical "bases" strung together in varying orders. To study the exact order (or sequence) of someone's DNA, researchers follow three major steps: (1) purify and copy the DNA (2) read the sequence and (3) compare to other sequences.

First they use chemical methods to purify, then, for some menthods, "amplify" the DNA in the sample - that means they copy small parts of the sample to reach high enough levels for measuring. The amplification step makes it possible to do DNA testing from very small starting amounts, like those in forensic samples or ancient bones. Then, different methods can be used to determine the order of each base in the DNA sample. Finally, they use computers to compare the sequence of the DNA to a reference sequence (for example, of the human genome), in order to see if there are any differences in the order of the bases.

Did you know that your genome contains about six billion individual building blocks - and that we can now read the order of all those building blocks in about a day and for about $1000? Leaps in technology since the Human Genome Project have enabled remarkable genomics-based advances in medicine, agriculture, forensics, and our understanding of evolution.

Our genome (that is, our DNA "blueprint") - and in fact the genomes of all life forms on earth - are made of four chemical "bases" strung together in varying orders. To study the exact order (or sequence) of someone's DNA, researchers follow three major steps: (1) purify and copy the DNA (2) read the sequence and (3) compare to other sequences.

First they use chemical methods to purify, then, for some menthods, "amplify" the DNA in the sample - that means they copy small parts of the sample to reach high enough levels for measuring. The amplification step makes it possible to do DNA testing from very small starting amounts, like those in forensic samples or ancient bones. Then, different methods can be used to determine the order of each base in the DNA sample. Finally, they use computers to compare the sequence of the DNA to a reference sequence (for example, of the human genome), in order to see if there are any differences in the order of the bases.


Conclusion

Fundamentally, bioinformaticians use powerful computers to process big biological data. They do this by building and/or using algorithms and databases.

It helps to know the four omics that generate the datasets in bioinformatics. These are genomics, transcriptomics, proteomics, and metabolomics. Genomics and transcriptomics generate datasets for bioinformatics via high throughput next-generation sequencing technologies. Proteomics and metabolomics generate datasets from mass spectroscopy technologies.

Finally, bioinformatics can be considered a data science with biology domain knowledge. With such a broad field as bioinformatics, there is always the chance of missing something. Here I try to give a broad overview of this field, covering the major subfields within it.


What is Bioinformatics?

Bioinformatics is a relatively new field and as such, many people aren’t exactly sure what “bioinformatics” really is.

“Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.”

Still confused? Don’t fret, most people are when they hear that definition. I usually like to tell people:

“Bioinformatics combines the latest technology with biological research.”

Over the past decade or so, and even prior, computers have become an integral part of every industry. Biological research is no different. Computer technology has dramatically accelerated the rate at which scientists are able to acquire and analyze biological data. The vast amount of data that is produced more rapidly each day has introduced new challenges to the field, involving storing, organizing and archiving this data. The sharp increase in volume of data has also brought about the need for faster and better analysis and visualization tools. Each area of bioinformatics, from acquiring to storing to analyzing the data, has challenges of its own, and it is not uncommon for advancements in one area to drive advancements in another.

To gain a better understanding of the diversity of bioinformatics, let’s invent a hypothetical yet interesting problem that we want to tackle using bioinformatics:

Let’s assume we have a species of bacteria that is part of the normal millions of ‘good’ bacteria living on and inside healthy human beings we’ll call this Bacteria X.0. One day Bacteria X started making people very ill. What happened to Bacteria X.0 to make it become the harmful Bacteria X.1? Let’s see how we could answer this question using bioinformatics, along the way gaining insight into the wonderful world of bioinformatics.

Using traditional molecular biology techniques, we isolate Bacteria X and extract its DNA. Then we “sequence” this DNA. Cue the first link in the bioinformatics chain: acquiring data! Acquiring data is the process of generating useable data from a biological sample. In our case, deriving and determining the DNA sequence of the Bacteria X genome.

The next link in the chain is storing this sequence data. While bacterial genomes are typically small, other genomes, such as those of human beings, can produce terabytes (1000 gigabytes) of data.

Now we analyze this sequence data. There are people who specialize in developing computational tools to analyze and visualize data, versus people who actually analyze the information. A typical analysis for our sample case might be to first graphically visualize and compare the genome of the original, harmless Bacteria X.0 with the genome of the new, harmful Bacteria X.1. A scientist might observe a segment of DNA in Bacteria X.1 which is not present in the original Bacteria X.0. This new region of DNA may be responsible for the harmful effects, so the next analysis steps might be to drill down deeper into this region and see what genes lie there, what the function of those genes are, where they may have come from, etc.

[Remember: all assumptions made and conclusions drawn in this example are hypothetical and for illustrative purposes only.]

In this example, we encountered at least 4 different specialized areas within the field of bioinformatics:

1) Acquiring of data (working with machines and equipment, sequencing DNA)
2) Storing data (typically working with databases)
3) Developing tools to analyze and visualize data (programming)
4) Analyzing data (statistics, analysis)

Typically, individuals will specialize in one particular area rather than working simultaneously across all these fields. That, combined with all the different applications of bioinformatics, means you could ask 100 different “bioinformaticians” what they do and get 100 very different answers!

Bioinformatics techniques are now employed in every area of biology and research, some of which include cancer research, crop yield optimization studies, medical genomics, ecology and evolution. The emerging field of DNA barcoding combines laboratory and bioinformatics techniques to catalogue all living species as well as identify new species. Since DNA is the blueprint of life, bioinformatics can be applied to any research involving living organisms (or organisms which once lived, see Otzi the Iceman).

One thing to remember: the four areas described above are not as simple as I’ve portrayed them to be. For example:

  • When sequencing a sample, you might be interested in sequencing RNA as opposed to DNA.
  • Before analyzing sequence data, the quality of this data must be validated. Sometimes large chunks of sequences need to be ‘put together’ (e.g., ‘genome assembly’). Both these areas (quality analysis and genome assembly) are highly sought after areas of specialization.
  • In addition to sequencing, data analysis can also generate vast amounts of new data.

The field of bioinformatics is ever changing and rapidly evolving. Techniques that were new 2–3 years ago might be outdated today vice-versa, techniques that were unpractical 2–3 years ago might be invaluable today, thanks to advances in computational processing capabilities, for instance.

So, whether you’re interested in plants, animals, bacteria, fungi, virology, genetics, developing databases, writing code, statistics, engineering, computer hardware, or web technologies, there may be a spot waiting for you in the field of bioinformatics.


Watch the video: Τι είναι οι Δυαδικοί αριθμοί και γιατί τους χρειάζονται οι υπολογιστές μας (September 2022).


Comments:

  1. Lan

    What a fun answer

  2. JoJobar

    An excellent and timely response.

  3. Tovi

    Completely I share your opinion. In it something is and it is excellent idea. It is ready to support you.



Write a message