We are searching data for your request:
Upon completion, a link will appear to access the found materials.
In BLASTP, the query sequence is broken into all possible 3-letter words using a moving window. Words with a score of 12 of more, i.e. words with more highly conserved amino acids, are collected into the initial BLASTP search set. Raw scores are then converted into bit scores by correcting for the scoring matrix used in the search and the size of the database search space.
Overview of the BLASTP process.
The query sequence EAGLES into broken into three-letter words or synonyms that are used as a search set against
records in a protein or translated nucleotide database. See the text for additional details.
The output data from BLASTP includes a table with the bit scores for each alignment
as well as its E-value, or “expect score”. The E-value indicates the number of alignments with
that particular bit score that would be expected to occur solely by chance in the search space. Alignments with the highest bit scores (and lowest E-values) are listed at the top of the table. For perfect or nearly perfect matches, the E-value is reported as zero - there is essentially no possibility that the match occurs randomly. The E-value takes into account both the length of the match and the size of the database that was surveyed. The longer the alignment, and/or the larger the database search space, the less likely that a particular alignment occurs strictly by chance.
In some cases, the alignment may not extend along the entire length of the protein or there may be gaps between aligned regions of the sequences. “Max score” is the bit score for the aligned region with the highest score. “Total score” adds the bit scores for all aligned regions. When there are no gaps in an alignment, the total and max scores are the same. The “Query cover” refers to the fraction of the query sequence where the alignment score is above the threshold value. BLASTP also reports the percentage of aligned amino acids that are identical in two sequences as “Ident.”
Find protein homologs with BLASTp
I'm trying to find homologs of a set of proteins using BLASTp. I'm working with custom databases.
I'm using evalue of 0.00001 as threshold.
I would like to filter queries having hits with >90% identities. Because BLASTp output is based on HSPs, I cannot filter by %identities/query, only by HSP.
I would like to know how to do this and also if I'm following a reasonable strategy.
Here is an alignment example: qcovs=100 but qcovhsp lower.
1 Answer 1
No, the search won't be less reliable. It will only be more sensitive, more capable of finding matches. To understand this, you need to understand how the BLAST algorithm works, what the word size means.
When you use a word size of $N$, BLAST starts by looking for a match of length $N$ between your query and target sequences. If such a match is found, BLAST then attempts to extend the match as far as possible. So, for example, consider these two sequences:
As you can see, they are quite similar:
Now, what happens if you try to find the target sequence with BLAST using the query sequence and a word size of 2 ($N=2$)? BLAST will take the first $N$ characters of the query and try to match them against the target:
It will then try to extend:
It will find a gap and apply the gap penalty, but continue matching since the next characters match 1 , thereby increasing the overall score of the alignment:
And so on and so forth. BLAST starts by matching the first $N$-sized word and then attempts to extend the alignment.
Now, what if we were using $N=3$?
Because the word size is now 3, we need at least three matching residues to match in order to attempt to extend the alignment. Since there is no 3-letter match, the target is missed.
So, decreasing the word size is better for finding small protein matches because it can use shorter matching sequences as an alignment seed. As a general rule, playing with the word size is a question of finding the right balance between specificity and sensitivity. Decreasing the word size helps you find more matches (increases sensitivity) but it will also find more divergent sequences (decrease specificity). Increasing the word size will reduce the number of matches (decrease sensitivity) because you need a longer perfect match to start extending the alignment, but will increase specificity for the same reason.
If you raise the word size to something close to the size of the query, you will only find almost exact matches. That might be what you want to do and it is what algorithms like BLAT which are designed to find near exact matches (for example, mapping a sequence to the genome it came from) do. Usually, however, we want less than perfect matches since we are looking for homologous sequences and not the exact same query. In such cases, we will decrease the word size for smaller proteins to increase our chances of finding homologs.
1 This is a bit of a simplification. The rules for extending alignments are a more complex but it serves to illustrate the point.
Materials and Methods
All versions of the BLASTN algorithm (2.2.25+, 2.2.26+, 2.2.27+, 2.2.28+, 2.2.29+, 2.2.30+) were obtained from NCBI (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/) and freshly installed locally on a Debian linux server with dual Intel Xeon 10-core Processors (E5-2690 v2) providing 40 logical cores and 384 Gb RAM. The query and target genome data were accessed from 12 Gb/sec SAS-connected solid state drives. Two such identical systems were utilized for the other studies (nucmer and LASTZ) described below. BLASTN commands/jobs were run within UNIX shell scripts and execution time was measured in seconds as determined by programmatically subtracting end from start times using standard shell commands in the job scripts. Query sequence sets of ‘10’, ‘100’, ‘1,000’, ‘10,000’, and ‘100,000’ sequences were individually tested. The query sets used randomly obtained 300 base fragments (via the Python3 random library) that were void of non-DNA characters (e.g. Ns) derived from three different chimpanzee chromosomes (3, 4, and 5) which provided three experimental replications for the study. The target database was the respective corresponding human homolog chromosome (3, 4, or 5) formatted using the “makeblastdb” program. Files were outputted in csv format with chromosome IDs and execution time log data post-job added to the csv files. Individual csv files were then concatenated and imported as a single dataframe into the R statistical package (v 3.2.1). Main categorical fixed effects (factors) were “BLAST version” and “data set size.” Continuous response variables analyzed were execution speed (sequences blasted per second), percentage of hits returned, and alignment identity. Main effects and interactions were analyzed using a standard general linear model and other statistical functions provided by the R base package. Plotting of data for the figures in this paper was performed in R using the ggplot2 library or the base R package. BLASTN parameters utilized were as follows: evalue 10, word_size 11, max_target_seqs 1, dust no, soft_ masking false, ungapped, and num_threads 6.
In preliminary studies, BLASTN version 2.2.25+ was tested with gapping allowed, but resulted in a continuous loop effect returning no hits. In later BLASTN versions, gapping returned hits, but the bug effect of omitting query sequences was similar to using ungapped. Therefore, the ungapped parameter was employed experiment-wide to statistically compare all algorithm versions. The BLAST+ software architecture and applications have been described in an earlier publication (Camacho et al. 2009).
To validate the ability of the BLASTN v2.2.25+ algorithm for consistency in returning hits at lower levels of identity (<95%), an additional experiment was performed. Ten random 300 base sequences from human chr22 were selected and then randomly mutated at 2% levels (incrementally) down to 74% identity using a Perl script written by this author. Sequence identity of each mutated sequence compared to the original and the BLAST output was evaluated using a Perl script written by this author.
The 2013 study performed by this author was repeated for all chromosomes using version 2.2.25+ of the BLASTN algorithm with the parameters described above and a sequence slice of 300 bases.
The nucmer genome alignment algorithm is part of the MUMmer3 package, which was downloaded from http://mummer.sourceforge.net and previously described in an earlier publication (Kurtz et al. 2004). Whole chimpanzee chromosomes were aligned onto human without the need for sequence slicing due to the nature of the algorithm. The following command sequence run in a shell script was used for each chimpanzee chromosome: nucmer -maxmatch -c 100 -p <human_chrom.fasta> <chimp_chrom.fasta> (commands showing shell output, redirection, and log file creation not shown). Data from the resulting nucmer delta file was extracted using the “showcoords” MUMmer command and the output was reformatted into a csv file using a shell script written by this author. The data was then imported into the R statistical package for basic summary statistics and plotting.
The LASTZ algorithm was developed by the Miller lab at the Penn State University Center for Comparative Genomics and Bioinformatics (Harris 2007) and downloaded at the following url: http://www.bx.psu.edu/miller_lab/dist/lastz-1.02.00.tar.gz. Due to the nature of the LASTZ algorithm when applied to whole chromosome alignments in preliminary studies done by this author, the chimpanzee chromosomes were sliced into 10,000 base fragments (retaining Ns) using a Python script written by this author. Typical default use of the LASTZ algorithm in comparing chimpanzee to human utilizes sequence masking to speed up the alignments and alleviate system resource restraints. Default parameters also omit non-similar sequence below a threshold of 95 to 98% identity. Therefore, to obtain a more realistic comparison of chimpanzee to human DNA, the following LASTZ command line sequence was employed in shell script format: lastz <human_chrom_fasta_file>[unmask] <chimp_chrom_file> --step=10 --seed=match12 --notransition --exact=20 --noytrim --ambiguous=n --filter=coverage:50 --filter=identity:50 –-format =general:start1, end1, length1, length2, strand2, identity > <out_file.dat>. The resulting data file from each alignment was reformatted into a csv file using a shell script written by this author and then imported into the R statistical package for basic summary statistics and plotting.
All UNIX shell, Python, R, and Perl scripts used in this study along with parameters for the various algorithms (contained inside shell scripts) were deposited at https://github.com/jt-icr/chimp_human_dna.
8.4 Pseudotime analysis
In many situations, one is studying a process where cells change continuously. This includes, for example, many differentiation processes taking place during development: following a stimulus, cells will change from one cell-type to another. Ideally, we would like to monitor the expression levels of an individual cell over time. Unfortunately, such monitoring is not possible with scRNA-seq since the cell is lysed (destroyed) when the RNA is extracted.
Instead, we must sample at multiple time-points and obtain snapshots of the gene expression profiles. Since some of the cells will proceed faster along the differentiation than others, each snapshot may contain cells at varying points along the developmental progression. We use statistical methods to order the cells along one or more trajectories which represent the underlying developmental trajectories, this ordering is referred to as “pseudotime”.
In this chapter we will consider five different tools: Monocle, TSCAN, destiny, SLICER and ouija for ordering cells according to their pseudotime development. To illustrate the methods we will be using a dataset on mouse embryonic development (Deng et al. 2014) . The dataset consists of 268 cells from 10 different time-points of early mouse development. In this case, there is no need for pseudotime alignment since the cell labels provide information about the development trajectory. Thus, the labels allow us to establish a ground truth so that we can evaluate and compare the different methods.
A recent review by Cannoodt et al provides a detailed summary of the various computational methods for trajectory inference from single-cell transcriptomics (Cannoodt, Saelens, and Saeys 2016) . They discuss several tools, but unfortunately for our purposes many of these tools do not have complete or well-maintained implementations, and/or are not implemented in R.
- SCUBA - Matlab implementation
- Wanderlust - Matlab (and requires registration to even download)
- Wishbone - Python
- SLICER - R, but package only available on Github
- SCOUP - C++ command line tool
- Waterfall - R, but one R script in supplement
- Mpath - R pkg, but available as tar.gz on Github function documentation but no vignette/workflow
- Monocle - Bioconductor package
- TSCAN - Bioconductor package
Unfortunately only two tools discussed (Monocle and TSCAN) meet the gold standard of open-source software hosted in a reputable repository.
The following figures from the paper summarise some of the features of the various tools.
Figure 8.9: Descriptions of trajectory inference methods for single-cell transcriptomics data (Fig. 2 from Cannoodt et al, 2016).
Figure 8.10: Characterization of trajectory inference methods for single-cell transcriptomics data (Fig. 3 from Cannoodt et al, 2016).
8.4.1 First look at Deng data
Let us take a first look at the Deng data, without yet applying sophisticated pseudotime methods. As the plot below shows, simple PCA does a very good job of displaying the structure in these data. It is only once we reach the blast cell types (“earlyblast”, “midblast”, “lateblast”) that PCA struggles to separate the distinct cell types.
PCA, here, provides a useful baseline for assessing different pseudotime methods. For a very naive pseudotime we can just take the co-ordinates of the first principal component.
As the plot above shows, PC1 struggles to correctly order cells early and late in the developmental timecourse, but overall does a relatively good job of ordering cells by developmental time.
Can bespoke pseudotime methods do better than naive application of PCA?
TSCAN combines clustering with pseudotime analysis. First it clusters the cells using mclust , which is based on a mixture of normal distributions. Then it builds a minimum spanning tree to connect the clusters. The branch of this tree that connects the largest number of clusters is the main branch which is used to determine pseudotime.
First we will try to use all genes to order the cells.
Frustratingly, TSCAN only provides pseudotime values for 221 of 268 cells, silently returning missing values for non-assigned cells.
Again, we examine which timepoints have been assigned to each state:
TSCAN gets the development trajectory the “wrong way around”, in the sense that later pseudotime values correspond to early timepoints and vice versa. This is not inherently a problem (it is easy enough to reverse the ordering to get the intuitive interpretation of pseudotime), but overall it would be a stretch to suggest that TSCAN performs better than PCA on this dataset. (As it is a PCA-based method, perhaps this is not entirely surprising.)
Exercise 1 Compare results for different numbers of clusters ( clusternum ).
Monocle skips the clustering stage of TSCAN and directly builds a minimum spanning tree on a reduced dimension representation of the cells to connect all cells. Monocle then identifies the longest path in this tree to determine pseudotime. If the data contains diverging trajectories (i.e. one cell type differentiates into two different cell-types), monocle can identify these. Each of the resulting forked paths is defined as a separate cell state.
Unfortunately, Monocle does not work when all the genes are used, so we must carry out feature selection. First, we use M3Drop:
We can again compare the inferred pseudotime to the known sampling timepoints.
Monocle - at least with its default settings - performs poorly on these data. The “late2cell” group is completely separated from the “zy”, “early2cell” and “mid2cell” cells (though these are correctly ordered), and there is no separation at all of “4cell”, “8cell”, “16cell” or any blast cell groups.
8.4.4 Diffusion maps
Diffusion maps were introduced by Ronald Coifman and Stephane Lafon, and the underlying idea is to assume that the data are samples from a diffusion process. The method infers the low-dimensional manifold by estimating the eigenvalues and eigenvectors for the diffusion operator related to the data.
Angerer et al have applied the diffusion maps concept to the analysis of single-cell RNA-seq data to create an R package called destiny.
We will take the ranko prder of cells in the first diffusion map component as “diffusion map pseudotime” here.
Like the other methods, using the first diffusion map component from destiny as pseudotime does a good job at ordering the early time-points (if we take high values as “earlier” in developement), but it is unable to distinguish the later ones.
Exercise 2 Do you get a better resolution between the later time points by considering additional eigenvectors?
Exercise 3 How does the ordering change if you only use the genes identified by M3Drop?
The SLICER method is an algorithm for constructing trajectories that describe gene expression changes during a sequential biological process, just as Monocle and TSCAN are. SLICER is designed to capture highly nonlinear gene expression changes, automatically select genes related to the process, and detect multiple branch and loop features in the trajectory (Welch, Hartemink, and Prins 2016) . The SLICER R package is available from its GitHub repository and can be installed from there using the devtools package.
We use the select_genes function in SLICER to automatically select the genes to use in builing the cell trajectory. The function uses “neighbourhood variance” to identify genes that vary smoothly, rather than fluctuating randomly, across the set of cells. Following this, we determine which value of “k” (number of nearest neighbours) yields an embedding that most resembles a trajectory. Then we estimate the locally linear embedding of the cells.
With the locally linear embedding computed we can construct a k-nearest neighbour graph that is fully connected. This plot displays a (yellow) circle for each cell, with the cell ID number overlaid in blue. Here we show the graph computed using 10 nearest neighbours. Here, SLICER appears to detect one major trajectory with one branch.
From this graph we can identify “extreme” cells that are candidates for start/end cells in the trajectory.
Having defined a start cell we can order the cells in the estimated pseudotime.
We can again compare the inferred pseudotime to the known sampling timepoints. SLICER does not provide a pseudotime value per se, just an ordering of cells.
Like the previous method, SLICER here provides a good ordering for the early time points. It places “16cell” cells before “8cell” cells, but provides better ordering for blast cells than many of the earlier methods.
Exercise 4 How do the results change for different k? (e.g. k = 5) What about changing the number of nearest neighbours in the call to conn_knn_graph ?
Exercise 5 How does the ordering change if you use a different set of genes from those chosen by SLICER (e.g. the genes identified by M3Drop)?
Ouija (http://kieranrcampbell.github.io/ouija/) takes a different approach from the pseudotime estimation methods we have looked at so far. Earlier methods have all been “unsupervised”, which is to say that apart from perhaps selecting informative genes we do not supply the method with any prior information about how we expect certain genes or the trajectory as a whole to behave.
Ouija, in contrast, is a probabilistic framework that allows for interpretable learning of single-cell pseudotimes using only small panels of marker genes. This method:
- infers pseudotimes from a small number of marker genes letting you understand why the pseudotimes have been learned in terms of those genes
- provides parameter estimates (with uncertainty) for interpretable gene regulation behaviour (such as the peak time or the upregulation time)
- has a Bayesian hypothesis test to find genes regulated before others along the trajectory
- identifies metastable states, ie discrete cell types along the continuous trajectory.
We will supply the following marker genes to Ouija (with timepoints where they are expected to be highly expressed):
- Early timepoints: Dazl, Rnf17, Sycp3, Nanog, Pou5f1, Fgf8, Egfr, Bmp5, Bmp15
- Mid timepoints: Zscan4b, Foxa1, Prdm14, Sox21
- Late timepoints: Creb3, Gpx4, Krt8, Elf5, Eomes, Cdx2, Tdgf1, Gdf3
With Ouija we can model genes as either exhibiting monotonic up or down regulation (known as switch-like behaviour), or transient behaviour where the gene briefly peaks. By default, Ouija assumes all genes exhibit switch-like behaviour (the authors assure us not to worry if we get it wrong - the noise model means incorrectly specifying a transient gene as switch-like has minimal effect).
Here we can “cheat” a little and check that our selected marker genes do actually identify different timepoints of the differentiation process.
In order to fit the pseudotimes wesimply call ouija , passing in the expected response types. Note that if no response types are provided then they are all assumed to be switch-like by default, which we will do here. The input to Ouija can be a cell-by-gene matrix of non-negative expression values, or an ExpressionSet object, or, happily, by selecting the logcounts values from a SingleCellExperiment object.
We can apply prior information about whether genes are up- or down-regulated across the differentiation process, and also provide prior information about when the switch in expression or a peak in expression is likely to occur.
We can fit the Ouija model using either:
- Hamiltonian Monte Carlo (HMC) - full MCMC inference where gradient information of the log-posterior is used to “guide” the random walk through the parameter space, or
- Automatic Differentiation Variational Bayes (ADVI or simply VI) - approximate inference where the KL divergence to an approximate distribution is minimised.
In general, HMC will provide more accurate inference with approximately correct posterior variance for all parameters. However, VB is orders of magnitude quicker than HMC and while it may underestimate posterior variance, the Ouija authors suggest that anecdotally it often performs as well as HMC for discovering posterior pseudotimes.
To help the Ouija model, we provide it with prior information about the strength of switches for up- and down-regulated genes. By setting switch strength to -10 for down-regulated genes and 10 for up-regulated genes with a prior strength standard deviation of 0.5 we are telling the model that we are confident about the expected behaviour of these genes across the differentiation process.
We can plot the gene expression over pseudotime along with the maximum a posteriori (MAP) estimates of the mean function (the sigmoid or Gaussian transient function) using the plot_expression function.
We can also visualise when in the trajectory gene regulation behaviour occurs, either in the form of the switch time or the peak time (for switch-like or transient genes) using the plot_switch_times and plot_transient_times functions:
Identify metastable states using consistency matrices.
Ouija does quite well in the ordering of the cells here, although it can be sensitive to the choice of marker genes and prior information supplied. How do the results change if you select different marker genes or change the priors?
Ouija identifies four metastable states here, which we might annotate as “zygote/2cell”, “4/8/16 cell”, “blast1” and “blast2”.
A common analysis is to work out the regulation orderings of genes. For example, is gene A upregulated before gene B? Does gene C peak before the downregulation of gene D? Ouija answers these questions in terms of a Bayesian hypothesis test of whether the difference in regulation timing (either switch time or peak time) is significantly different to 0. This is collated using the gene_regulation function.
What conclusions can you draw from the gene regulation output from Ouija?
If you have time, you might try the HMC inference method and see if that changes the Ouija results in any way.
8.4.7 Comparison of the methods
How do the trajectories inferred by TSCAN, Monocle, Diffusion Map, SLICER and Ouija compare?
TSCAN and Diffusion Map methods get the trajectory the “wrong way round”, so we’ll adjust that for these comparisons.
We see here that Ouija, TSCAN and SLICER all give trajectories that are similar and strongly correlated with PC1. Diffusion Map is less strongly correlated with these methods, and Monocle gives very different results.
Exercise 6: Compare destiny and SLICER to TSCAN, Monocle and Ouija in more depth. Where and how do they differ?
8.4.8 Expression of genes through time
Each package also enables the visualization of expression through pseudotime. Following individual genes is very helpful for identifying genes that play an important role in the differentiation process. We illustrate the procedure using the Rhoa gene.
We have added the pseudotime values computed with all methods here to the colData slot of an SCE object. Having done that, the full plotting capabilities of the scater package can be used to investigate relationships between gene expression, cell populations and pseudotime. This is particularly useful for the packages such as SLICER that do not provide plotting functions.
How many of these methods outperform the naive approach of using the first principal component to represent pseudotime for these data?
Exercise 7: Repeat the exercise using a subset of the genes, e.g. the set of highly variable genes that can be obtained using Brennecke_getVariableGenes()
We have described DELTA-BLAST, a new tool for detecting distant homologs in a protein database search. The results of our experiments show that DELTA-BLAST detects more homologs and provides better quality alignments than do other programs analyzed in this paper.
DELTA-BLAST’s strategy is distinct from those of CS-BLAST and PSI-BLAST in a number of ways. It uses long, putatively homologous alignments with CDs to build its PSSMs, whereas CS-BLAST uses short (13-residue-wide), not necessarily homologous matches with context library profiles. PSI-BLAST performs a BLASTP search to produce alignments and build a PSSM without requiring a specialized or preprocessed resource, although it needs more time for this task. CDD requires more effort to maintain than does the CS-BLAST library of context profiles. However, CDD is an actively maintained resource that is already heavily used for protein annotation at NCBI.
We are exploring ways to improve DELTA-BLAST’s performance, e.g. by developing better methods for weighting coincident hits to several CDs, and by using more information stored in CDD, such as domain hierarchies and specific hit scores (see  for details). Since selecting appropriate CDs is at the heart of DELTA-BLAST’s performance, we are also exploring improvements to RPS-BLAST, and the use of different CDD search tools. Our initial experiments suggest that GLOBAL  works very well for several SCOP superfamilies that are problematic for RPS-BLAST. We also plan to address the performance decline in iterated DELTA-BLAST by using approaches that maintain the information content in the PSSMs constructed.
9.7: The BLASTP algorithm - Biology
&dagger Abbreviations used: BLAST, basic local alignment search tool MSP, maximal segment pair bp, base-pair(s).
(a) The maximal segment pair measure
Sequence similarity measures generally can be classified as either global or local. Global similarity algorithms optimize the overall alignment of two sequences, which may include large stretches of low similarity (Needleman & Wunsch, 1970). Local similarity algorithms seek only relatively conserved subsequences, and a single comparison may yield several distinct subsequence alignments unconserved regions do not contribute to the measure of similarity (Smith & Waterman, 1981 Goad & Kanehisa, 1982 Sellers, 1984). Local similarity measures are generally preferred for database searches, where cDNAs may be compared with partially sequenced genes, and where distantly related proteins may share only isolated regions of similarity, e.g. in the vicinity of an active site.
Many similarity measures, including the one we employ, begin with a matrix of similarity scores for all possible pairs of residues. Identities and conservative replacements have positive scores, while unlikely replacements have negative scores. For amino acid sequence comparisons we generally use the PAM-120 matrix (a variation of that of Dayhoff et al. , 1978), while for DNA sequence comparisons we score identities +5, and mismatches &minus4 other scores are of course possible. A sequence segment is a contiguous stretch of residues of any length, and the similarity score for two aligned segments of the same length is the sum of the similarity values for each pair of aligned residues.
Given these rules, we define a maximal segment pair (MSP) to be the highest scoring pair of identical length segments chosen from 2 sequences. The boundaries of an MSP are chosen to maximize its score, so an MSP may be of any length. The MSP score, which BLAST heuristically attempts to calculate, provides a measure of local similarity for any pair of sequences. A molecular biologist, however, may be interested in all conserved regions shared by 2 proteins, not only in their highest scoring pair. We therefore define a segment pair to be locally maximal if its score cannot be improved either by extending or by shortening both segments (Sellers, 1984). BLAST can seek all locally maximal segment pairs with scores above some cutoff.
Like many other similarity measures, the MSP score for 2 sequences may be computed in time proportional to the product of their lengths using a simple dynamic programming algorithm. An important advantage of the MSP measure is that recent mathematical results allow the statistical significance of MSP scores to be estimated under an appropriate random sequence model (Karlin & Altschul, 1990 Karlin et al. , 1990). Furthermore, for any particular scoring matrix ( e.g. PAM-120) one can estimate the frequencies of paired residues in maximal segments. This tractability to mathematical analysis is a crucial feature of the BLAST algorithm.
(b) Rapid approximation of MSP scores
In searching a database of thousands of sequences, generally only a handful, if any, will be homologous to the query sequence. The scientist is therefore interested in identifying only those sequence entries with MSP scores over some cutoff score S . These sequences include those sharing highly significant similarity with the query as well as some sequences with borderline scores. This latter set of sequences may include high scoring random matches as well as sequences distantly related to the query. the biological significance of the high scoring sequences may be inferred almost solely on the basis of the similarity score, while the biological context of the borderline sequences may be helpful in distinguishing biologically interesting relationships.
Recent results (Karlin & Altschul, 1990 Karlin et al. , 1990) allow us to estimate the highest MSP score S at which chance similarities are likely to appear. To accelerate database searches, BLAST minimizes the time spent on sequence regions whose similarity with the query has little chance of exceeding this score. Let a word pair be a segment pair of fixed length w . The main strategy of BLAST is to seek only segment pairs that contain a word pair with a score of at least T . Scanning through a sequence, one can determine quickly whether it contains a word of length w that can pair with the query sequence to produce a word pair with a score greater than or equal to the threshold T . Any such hit is extended to determine if it is contained within a segment pair whose score is greater than or equal to S . The lower the threshold T , the greater the chance that a segment pair with a score of at least S will contain a word pair with a score of at least T . A small value for T , however, increases the number of hits and therefore the execution time of the algorithm. Random simulation permits us to select a threshold T that balances these considerations.
In our implementations of this approach, details of the 3 algorithmic steps (namely compiling a list of high-scoring words, scanning the database for hits, and extending hits) vary somewhat depending on whether the database contains proteins or DNA sequences. For proteins, the list consists of all words ( w -mers) that score at least T when compared to some word in the query sequence. Thus, a query word may be represented by no words in the list ( e.g. for common w -mers using PAM-120 scores) or by many. (One may, of course, insist that every w -mer in the query sequence be included in the word list, irrespective of whether pairing the word with itself yields a score of a least T .) For values of w and T that we have found most useful (see below), there are typically of the order of 50 words in the list for every residue in the query sequence, e.g. 12,500 words for a sequence of length 250. If a little care is taken in programming, the list of words can be generated in time essentially proportional to the length of the list.
The scanning phase raised a classic algorithmic problem, i.e. search a long sequence for all occurrences of certain short sequences. We investigated 2 approaches. Simplified, the first works as follows. suppose that w = 4 and map each word to an integer between 1 and 20 4 , so a word can be used as an index into an array of size 20 4 = 160,000. Let the i th entry of such an array point to the list of all occurrences in the query sequence of the i th word. Thus, as we scan the database, each database word leads us immediately to the corresponding hits. Typically, only a few thousand of the 20 4 possible words will be in this table, and it is easy to modify the approach to use far fewer than 20 4 pointers.
The second approach we explored for the scanning phase was the use of a deterministic finite automaton or finite state machine (Mealy, 1955 Hopcroft & Ullman, 1979). An important feature of our construction was to signal acceptance on transitions (Mealy paradigm) as opposed to on states (Moore paradigm). In the automaton&rsquos construction, this saved a factor in space and time roughly proportional to the size of the underlying alphabet. This method yielded a program that ran faster and we prefer this approach for general use. With typical query lengths and parameter settings, this version of BLAST scans a protein database at approximately 500,00 residues/s.
Extending a hit to find a locally maximal segment pair containing that hit is straightforward. To economize time, we terminate the process of extending in one direction when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions. This introduces a further departure from the ideal of finding guaranteed MSPs, but the added inaccuracy is negligible, as can be demonstrated by both experiment and analysis ( e.g. for protein comparisons the default distance is 20, and the probability of missing a higher scoring extension is about 0.001).
For DNA, we use a simpler word list, i.e. the list of all contiguous w -mers in the query sequence, often with w = 12. Thus, a query sequence of length n yields a list of n-w+ 1 words, and again there are commonly a few thousand words in the list. It is advantageous to compress the database by packing 4 nucleotides into a single byte, using an auxiliary table to delimit the boundaries between adjacent sequences. Assuming w &ge 11, each hit must contain an 8-mer hit that lies on a byte boundary. This observation allows us to scan the database byte-wise and thereby increase speed 4-fold. For each 8-mer hit, we check for an enclosing w -mer hit if found, we extend as before. Running on a SUN4, with a query of typical length ( e.g. several thousand bases), BLAST scans at approximately 2 x 10 6 bases/s. At facilities which run many such searches a day, loading the compressed database into memory once in a shared memory scheme affords substantial saving in subsequent search times.
It should be noted that DNA sequences are highly non-random, with locally biased base composition ( e.g. A+T-rich regions), and repeated sequence elements ( e.g. Alu sequences) and this has important consequences for the design of a DNA database search tool. If a given query sequence has, for example, an A+T-rich subsequence, or a commonly occurring repetitive element, then a database search will produce a copious output of matches with little interest. We have designed a somewhat ad hoc but effective means of dealing with these 2 problems. The program that produces the compressed version of the DNA database tabulates the frequencies of all 8-tuples. Those occurring much more frequently than expected by chance (controllable by parameter) are stored and used to filter &ldquouninformative&rdquo words from the query word list. Also, preceding full database searches, a search of a sublibrary of repetitive elements is performed, and the locations in the query of significant matches are stored. Words generated by these regions are removed from the query word list for the full search. Matches to the sublibrary, however, are reported in the final output. These 2 filters allow alignments to regions with biased composition, or to regions containing repetitive elements to be reported, as long as adjacent regions not containing such features share significant similarity to the query sequence.
The BLAST strategy admits numerous variations. We implemented a version of BLAST that uses dynamic programming to extend hits so as to allow gaps in the resulting alignments. Needless to say, this greatly slows the extension process. While the sensitivity of amino acid searches was improved in some cases, the selectivity was reduced as well. Given the trade-off of speed and selectivity for sensitivity, it is questionable whether the gap version of BLAST constitutes an improvement. We also implemented the alternative of making a table of all occurrences of the w -mers in the database, then scanning the query sequence and processing hits. The disk space requirements are considerable, approximately 2 computer words for every residue in the database. More damaging was that for query sequences of typical length, the need for random access into the database (as opposed to sequential access) made the approach slower, on the computer systems we used, than scanning the entire database.
To evaluate the utility of our method, we describe theoretical results about the statistical significance of MSP scores, study the accuracy of the algorithm for random sequences at approximating MSP scores, compare the performance of the approximation to the full calculation on a set of related protein sequences and, finally, demonstrate its performance comparing long DNA sequences.
(a) Performance of BLAST with random sequences
Theoretical results on the distribution of MSP scores from the comparison of random sequences have recently become available (Karlin & Altschul, 1990 Karlin et al. , 1990). In brief, given a set of probabilities for the occurrence of individual residues, and a set of scores for aligning pairs of residues, the theory provides two parameters &lambda and K for evaluating the statistical significance of MSP scores. When two random sequences of lengths m and n are compared, the probability of finding a segment pair with a score greater than or equal to S is:
where y = Kmn e -&lambdaS . More generally, the probability of finding c or more distinct segment pairs, all with a score of at least S , is given by the formula:
|1&minuse &minus y|| |
Using this formula, two sequences that share several distinct regions of similarity can sometimes be detected as significantly related, even when no segment pair is statistically significant in isolation.
While finding an MSP with a p -value of 0.001 may be surprising when two specific sequences are compared, searching a database of 10,000 sequences for similarity to a query sequence is likely to turn up ten such segment pairs simply by chance. Segment pair p -values must be discounted accordingly when the similar segments are discovered through blind database searches. Using formula (1), we can calculate the approximate score an MSP must have to be distinguishable from chance similarities found in a database.
We are interested in finding only segment pairs with a score above some cutoff S . The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T . It is therefore of interest to know what proportion of segment pairs with a given score contain such a word pair. This question makes sense only in the context of some distribution of high-scoring segment pairs. For MSPs arising from the comparison of random sequences, Dembo & Karlin (1991) provide such a limiting distribution. Theory does not yet exist to calculate the probability q that such a segment pair will fail to contain a word pair with a score of at least T . However, one argument suggests that q should depend exponentially upon the score of the MSP. Because the frequencies of paired letters in MSPs approaches a limiting distribution (Karlin & Altschul, 1990), the expected length of an MSP grows linearly with its score. Therefore, the longer an MSP, the more independent chances it effectively has for containing a word with a score of at least T , implying that q should decrease exponentially with increasing MSP score S .
To test this idea, we generated one million pairs of &ldquorandom protein sequences&rdquo (using typical amino acid frequencies) of length 250, and found the MSP for each using PAM-120 scores. In Figure 1, we plot the logarithm of the fraction q of MSPs with score S that do not contain a word pair of length four with score at least 18. Since the values shown are subject to statistical variation, error bars represent one standard deviation. A regression line is plotted, allowing for heteroscedasticity (differing degrees of accuracy of the y -values). The correlation coefficient for -ln( q ) and S is 0.999, suggesting that for practical purposes our model of the exponential dependence of q upon S is valid.
We repeated this analysis for a variety of word lengths and associated values of T . Table 1 shows the regression parameters a and b found for each instance the correlation was always greater than 0.995. Table 1 also shows the implied percentage q = e -( aS+b ) of MSPs with various scores that would be missed by the BLAST algorithm. These numbers are of course properly applicable only to chance MSPs. However, using a log-odds score matrix such as the PAM-120 that is based upon empirical studies of homologous proteins, high-scoring chance MSPs should resemble MSPs that reflect true homology (Karlin & Altschul, 1990). Therefore, Table 1 should provide a rough guide to the performance of BLAST on homologous as well as chance MSPs.
Based on the results of Karlin et al. (1990), Table 1 also shows the expected number of MSPs found when searching a random database of 16,000 length 250 protein sequences with a length 250 query. (These numbers were chosen to approximate the current size of the PIR database and the length of an average protein.) As seen from Table 1, only MSPs with a score over 55 are likely to be distinguishable from chance similarities. With w = 4 and T = 17, BLAST should miss only about a fifth of the MSPs with this score, and only about a tenth of MSPs with a score near 70. We will consider below the algorithm&rsquos performance on real data.
(b) The choice of word length and threshold parameters
On what basis do we choose the particular setting of the parameters w and T for executing BLAST on real data? We begin by considering the word length w .
The time required to execute BLAST is the sum of the times required (1) to compile a list of words that can score at least T when compared with words from the query (2) to scan the database for hits ( i.e. matches to words on this list) and (3) to extend all hits to seek segment pairs with scores exceeding the cutoff. The time for the last of these tasks is proportional to the number of hits, which clearly depends on the parameters w and T . given a random protein model and a set of substitution scores, it is simple to calculate the probability that two random words of length w will have a score of at least T , i.e. the probability of a hit arising from an arbitrary pair of words in the query and the database. Using the random model and scores of the previous section, we have calculated these probabilities for a variety of parameter choices and recorded them in Table 1. For a given level of sensitivity (chance of missing an MSP), one can ask what choice of w minimizes the chance of a hit. Examining Table 1, it is apparent that the parameter pairs ( w = 3, T = 14), ( w = 4, T = 16) and ( w = 5, T = 18) all have approximately equivalent sensitivity over the relevant range of cutoff scores. The probability of a hit yielded by these parameter pairs is seen to decrease for increasing w the same also holds for different levels of sensitivity. This makes intuitive sense, for the longer the word pair examined the more information gained about potential MSPs. Maintaining a given level of sensitivity, we can therefore decrease the time spent on step (3), above, by increasing the parameter w . However, there are complementary problems created by large w . For proteins there are 20 w possible words of length w , and for a given level of sensitivity the number of words generated by a query grows exponentially with w . (For example, using the 3 parameter pairs above, a 30 residue sequence was found to generate word lists of size 296, 3561 and 40,939 respectively.) This increases the time spent on step (1), and the amount of memory required. In practice, we have found that for protein searches the best compromise between these considerations is with a word size of four this is the parameter setting we use in all analyses that follow.
Although reducing the threshold T improves the approximation of MSP scores by BLAST, it also increases execution time because there will be more words generated by the query sequence and therefore more hits. What value of T provides a reasonable compromise between the considerations of sensitivity and time? To provide numerical data, we compared a random 250 residue sequence against the entire PIR database (Release 23.0, 14,372 entries and 3,977,903 residues) with T ranging from 20 to 13. In Figure 2 we plot the execution time (user time on a SUN4-280) versus the number of words generated for each value of T . Although there is a linear relationship between the number of words generated and execution time, the number of words generated increases exponentially with decreasing T over this range (as seen by the spacing of x values). This plot and a simple analysis reveal that the expected-time computational complexity of BLAST is approximately aW + bN + cNW/ 20 w , where W is the number of words generated, N is the number of residues in the database and a , b and c are constants. The W term accounts for compiling the word list, the N term covers the database scan, and the NW term is for extending the hits. Although the number of words generated, W , increases exponentially with decreasing T , it increases only linearly with the length of the query, so that doubling the query length doubles the number of words. We have found in practice that T = 17 is a good choice for the threshold because, as discussed below, lowering the parameter further provides little improvement in the detection of actual homologies.
BLAST&rsquos direct tradeoff between accuracy and speed is best illustrated by Table 2. Given a specific probability q of missing a chance MSP with score S , one can calculate what threshold parameter T is required, and therefore the approximate execution time. Combining the data of Table 1 and Figure 2, Table 2 shows the central processing unit times required (for various values of q and S ) to search the current PIR database with a random query sequence of length 250. To have about a 10% chance of missing an MSP with the statistically significant score of 70 requires about nine seconds of central processing unit time. To reduce the chance of missing such an MSP to 2% involves lowering T , thereby doubling the execution time. Table 2 illustrates, furthermore, that the higher scoring (and more statistically significant) an MSP, the less time is required to find it with a given degree of certainty.
(c) Performance of BLAST with homologous sequences
To study the performance of BLAST on real data, we compared a variety of proteins with other members of their respective superfamilies (Dayhoff, 1978), computing the true MSP scores as well as the BLAST approximation with word length four and various settings of the parameter T . Only with superfamilies containing many distantly related proteins could we obtain results usefully comparable with the random model of the previous section. Searching the globins with woolly monkey myoglobin (PIR code MYMQW), we found 178 sequences containing MSPs with scores between 50 and 80. Using word length four and T parameter 17, the random model suggests BLAST should miss about 24 of these MSPs in fact, it misses 43. This poorer than expected performance is due to the uniform pattern of conservation in the globins, resulting in a relatively small number of high-scoring words between distantly related proteins. A contrary example was provided by comparing the mouse immunoglobulin &kappa chain precursor V region (PIR code KVMST1) with immunoglobulin sequences, using the same parameters as previously. Of the 33 MSPs with scores between 45 and 65, BLAST missed only two the random model suggests it should have missed eight. In general, the distribution of mutations along sequences has been shown to be more clustered than predicted by a Poisson process (Uzzell & Corbin, 1971), and thus the BLAST approximation should, on average, perform better on real sequences than predicted by the random model.
BLAST&rsquos great utility is for finding high-scoring MSPs quickly. In the examples above, the algorithm found all but one of the 89 globin MSPs with a score over 80, and all of the 125 immunoglobulin MSPs with a score over 50. The overall performance of BLAST depends upon the distribution of MSP scores for those sequences related to the query. In many instances, the bulk of the MSPs that are distinguishable from chance have a high enough score to be found readily by BLAST, even using relatively high values of the T parameter. Table 3 shows the number of MSPs with a score above a given threshold found by BLAST when searching a variety of superfamilies using a variety of T parameters. In each instance, the threshold S is chosen to include scores in the borderline region, which in a full database search would include chance similarities as well as biologically significant relationships. Even with T equal to 18, virtually all the statistically significant MSPs are found in most instances.
Comparing BLAST (with parameters w = 4, T = 17) to the widely used FASTP program (Lipman & Pearson, 1985 Pearson & Lipman, 1988) in its most sensitive mode ( ktup = 1), we have found that BLAST is of comparable sensitivity, generally yields fewer false positives (high-scoring but unrelated matches to the query), and is over an order of magnitude faster.
(d) Comparison of two long DNA sequences
Sequence data exist for a 73,360 bp section of the human genome containing the &beta-like globin gene cluster and for a corresponding 44,595 bp section of the rabbit genome (Margot et al ., 1989). The pair exhibits three main classes of locally similar regions, namely genes, long interspersed repeats and certain anticipated weaker similarities, as described below. We used the BLAST algorithm to locate locally similar regions that can be aligned without introduction of gaps.
The human gene cluster contains six globin genes, denoted &epsilon , G &gamma , A &gamma , &eta , &delta and &beta , while the rabbit cluster has only four, namely, &epsilon , &gamma , &delta and &beta . (Actually, rabbit &delta is a pseudogene.) Each of the 24 gene pairs, one human gene and one rabbit gene, constitutes a similar pair. An alignment of such a pair requires insertion and deletions, since the three exons of one gene generally differ somewhat in their lengths from the corresponding exons of the paired gene, and there are even more extensive variations among the introns. Thus, a collection of the highest scoring alignments between similar regions can be expected to have at least 24 alignments between gene pairs.
Mammalian genomes contain large numbers of long interspersed repeat sequences, abbreviated LINES . In particular, the human &beta -like globin cluster contains two overlapped L1 sequences (a type of LINE ) and the rabbit cluster has two tandem L1 sequences in the same orientation, both around 6000 bp in length. These human and rabbit L1 sequences are quite similar and their lengths make them highly visible in similarity computations. In all, eight L1 sequences have been cited in the human cluster and five in the rabbit cluster, but because of their reduced length and/or reversed orientation, the other published L1 sequences do not affect the results discussed below. Very recently, another piece of an L1 sequence has been discovered in the rabbit cluster (Huang et al. , 1990).
Evolution theory suggests that an ancestral gene cluster arranged as 5&rsquo- &epsilon-&gamma-&eta-&delta-&beta -3&rsquo may have existed before the mammalian radiation. Consistent with this hypothesis, there are inter-gene similarities within the &beta clusters. For example, there is a region between human &epsilon and G &gamma that is similar to a region between rabbit &epsilon and &gamma .
We applied a variant of the BLAST programs to these two sequences, with match score 5, mismatch score &minus4 and, initially, w = 12. The program found 98 alignments scoring over 200, with 1301 being the highest score. Of the 57 alignments scoring over 350, 45 paired genes (with each of the 24 possible gene pairs represented) and the remaining 12 involved L1 sequences. Below 350, inter-gene similarities (as described above) appear, along with additional alignments of genes and of L1 sequences. Two alignments with scores between 200 and 350 do not fit the anticipated pattern. One reveals the newly discovered section of L1 sequence. The other aligns a region immediately 5&rsquo from the human &beta gene with a region just 5&rsquo from rabbit &delta . This last alignment may be the result of an intrachromosomal gene conversion between &delta and &beta in the rabbit genome (Hardison & Margot, 1984).
With smaller values of w , more alignments are found. In particular, with w = 8, an additional 32 alignments are found with a score above 200. All of these fall in one of the three classes discussed above. Thus, use of a smaller w provides no essentially new information. The dependence of various values on w is given in Table 4. Time is measured in seconds on a SUN4 for a simple variant of BLAST that works with uncompressed DNA sequences.
Choosing a BLAST program
Of the five BLAST programs available, we primarily use Standard Nucleotide BLAST (blastn), Standard Protein BLAST (blastp), and Translated BLAST (blastx). NCBI has a terrific getting-started guide for BLAST, which includes a simple explanation of the different BLAST programs, databases, and elements of the BLAST search pages.
At Addgene, we use blastn to identify any discrepancies in Sanger sequences, such as mismatches, deletions, or insertions. We use blastp or blastx to compare our sequencing results to protein sequences to check open reading frames (ORFs) and determine the potential effect of any nucleotide discrepancies. The blastp and blastx programs are optimized differently and you may want to select one (or both) depending on the information you want to verify. We will delve into these differences below.
Algorithmic overview of DIAMOND
DIAMOND uses the double-indexing approach, in conjunction with multiple spaced seeds 17 , to optimize the handling of large query and large reference databases. In the first step, tables of seed–location pairs are built for query and reference sequences. Next, matching seeds are computed using a hash join technique that conducts recursive radix clustering of both tables until a hash table for the query data fits into the cache, at which point the rest of the join is computed by hashing 18 . We found this approach to be faster than the sorting method used by older versions of DIAMOND 16 , especially given that a full sorting of the reference table is avoided for smaller query datasets.
The double-indexing algorithm is designed to be cache aware, given that the data associated with one seed need to be loaded for comparison from memory only once, while the classical index-based linear seed lookup suffers from poor data locality. Additionally, our on-the-fly indexing method enables efficient use of multiple spaced seeds by processing the shapes one at a time and not requiring the index tables for all shapes to be present in memory simultaneously, while also avoiding expensive seed lookups through our cache-friendly hash join implementation.
DIAMOND (v2.0.7) uses two seed shapes of weight 10 for its fast mode, 16 shapes of weight 8 and 14 shapes of weight 7 for its sensitive and very-sensitive modes, respectively, and 64 shapes of weight 7 for its ultra-sensitive mode. The seed shapes were computed using SpEED 19 . Even when operating with 64 shapes, the run time generation of the indices, together with the join computation, take up less than 1% of the total run time of the program. When processing the NCBI nr database, the total size of these indices would be 123 billion letters × 9 bytes per entry × 64 shapes, which is
64 TB if kept in memory or written to disk, while DIAMOND (v2.0.7) requires less than 16 GB of RAM when running in ultra-sensitive mode. This shows that DIAMOND does not require expensive computing infrastructures and can be operated with modest hardware resources if needed. Because of the runtime indexing, DIAMOND maintains disk-based database files that contain only the reference sequences, and can optionally also use BLAST databases (since v2.0.8).
Hamming distance filter
In the first stage of the sequence comparison process, a hamming distance computation between a query sequence and a subject sequence is performed at all seed hit locations in a 48-letter window encompassing the hit. We optimized this procedure using a chain of SSE (streaming single-instruction multiple-data (SIMD) extensions) pcmpeqb, pmovmskb and popcnt instructions to achieve a tenfold decrease in computation time compared with an ungapped alignment incorporating a scoring matrix, while reducing the number of hits by 1–2 orders of magnitude. A sensitivity-level-dependent cut-off for the hamming distance that can also be manually set by the user determines whether a hit is passed to the next filter stage.
We further extend our initial approach, introduced in the original version of DIAMOND 16 , and maximize the filtering throughput by using a loop-tiling strategy to incorporate the cache hierarchy and address the fact that the data associated with a single seed may exceed the cache capacity in the new very-sensitive and ultra-sensitive modes of DIAMOND (v2.0.7). We also load the 48-letter windows at the query and subject locations into linear buffers prior to running the all-versus-all hamming distance computation, to make best use of the hardware prefetcher and to avoid any random memory access.
After the hamming distance stage, the next step in the pipeline computes ungapped extensions at the seed hit locations. This procedure is vectorized using AVX2 instructions, aligning one query against up to 32 subject sequences. After 32 subject sequences are loaded into AVX2 registers, a 32 × 32 byte matrix transposition is computed using a series of 160 unpack instructions, such that 32 letters of different subjects are interleaved into one SIMD register, and the match scores can be loaded along the query. A sensitivity-level-dependent e-value threshold determines the hits that will be passed to the next stage.
Leftmost seed filter
Due to its double-indexing algorithm, DIAMOND may find the same alignment multiple times independently during the search stage. These redundant hits need to be filtered out to avoid an excessive use of temporary disk space. DIAMOND accomplishes this task by inspecting the local ungapped alignment for seed hits to the left of the hit that is currently being processed, as well as seed hits by previously processed shapes. If such a hit is found, DIAMOND notices the repetition and the current hit is discarded. Given that this procedure entails checking against up to 64 different seed shapes, we further optimized this process by incorporating a precomputed lookup table that stores information on whether any of the processed shapes will hit a given bit-encoded match or mismatch pattern, thus enabling the same check to be performed in one pass over the local hit pattern.
Given that the typical application of an aligner will require the reporting of a certain number of best alignments (hits) for each query (as set on the command line using the --max-target-seqs option), DIAMOND makes use of this parameter to control the computational effort spent on seed extension and avoid having to compute gapped extensions for all seed hits. To this end, after the seed search within target sequences has been concluded, we determine a tentative order of target hits with respect to a single query. In the present case, this ranking procedure uses the ungapped extension scores at seed hits to assign a linear order to the targets. DIAMOND sorts the target list by ungapped extension score (from best to worst) for each target, similar to the way in which MMSeqs2 uses its ungapped extension-derived prefilter scores. Although MMSeqs2 will then compute Smith–Waterman extensions for a fixed number of best targets (as set using the --max-seqs parameter), DIAMOND uses a dynamic criterion to halt evaluation of further targets. We refer to this dynamic approach as adaptive ranking, which improves the DIAMOND reporting accuracy compared with the static criterion used by MMSeqs2, while providing a less biased and more data-adapted filtering procedure. The ranked list is processed in chunks of 400 targets (configurable on the command line using ext-chunk-size), for which extensions are computed. If no extension in the current chunk yields a significant alignment under the user-specified reporting criteria, computation of further extensions for the query is aborted, otherwise the next chunk of targets will be processed.
Gapped extension filter
Given that computing full Smith–Waterman 20 extensions is expensive, we have developed a fast heuristic algorithm designed to estimate a gapped alignment score and discard hits that most probably do not meet the user-set reporting threshold. We use a query profile data structure in the same way as the vectorized Smith–Waterman algorithm introduced by Farrar 21 , which is an array for each of the amino acid letters that stores the scores along the query against the given residue. We then use AVX2 instructions to sum up these scores along diagonals of the dynamic programming matrix, thus computing local ungapped extension scores on diagonals. This approach ignores gaps in the alignment and therefore eliminates intra-register data dependencies. With its minimal logic, our heuristic achieves a throughput
fivefold faster than a Smith–Waterman computation using the vectorized SWIPE method 22 . Nevertheless, ungapped scores on the diagonals can be used to estimate a gapped extension score by thresholding and computing a one-dimensional dynamic program that disregards the location of the diagonal segments. Although this simplifying assumption leads to an overestimation of the true alignment score most of the time, the heuristic is still able to reduce the number of spurious hits by one order of magnitude in the most sensitive alignment mode. If required by the user, this filter step can be disabled using the option gapped-filter-evalue 0.
Chaining is the computation of a dynamic program at the level of diagonal segments instead of at the base or residue level, and has been used successfully in DNA alignment tools such as minimap2 (ref. 23 ). DIAMOND (v2.0.7) introduces the use of chaining on protein sequences. The result of the chaining computation is used to infer a scaffold for the optimal alignment and to determine the band geometry for a banded Smith–Waterman algorithm 20 .
Chaining can be simplified on DNA sequences by considering only diagonal segments of exact matches. However, this is not possible for protein sequences, which makes this computation substantially more elaborate. DIAMOND solves this problem by sorting the diagonal segments obtained by the ungapped extension stage on the starting position in the subject, and constructs a graph in which nodes represent diagonal segments and edges denote diagonal shifts (gaps) by computing pairwise connections between the diagonal segments in one left-to-right pass. Such pairwise connections are then stored as graph edges, incorporating their inbound and outbound coordinates to prevent invalid chains and to allow zigzag connections in which the optimal path repeatedly shifts between the same two diagonal nodes. A red–black tree for the nodes ordered on the diagonal is used to quickly access the most proximal nodes and candidates for determining a connection. For each node, the best score of a local alignment ending in that node is stored, the maximum of which yields the final score estimate and end point for backtracing of the approximate optimal alignment.
The final extensions are computed using a modified version of the vectorized SWIPE (ref. 22 ) approach that accommodates banding. Due to their design, both the SWIPE and the ‘striped’ SIMD vectorization 21 algorithms do not easily allow banded alignment, resulting in the need for an O(n²) computation in proportion to the length of the query and subject sequences. We vectorize the alignment of a query against up to 32 subjects by overlaying the banded dynamic programming matrix columns of the subjects based on their query ranges (the query coordinate interval [i0,i1] that corresponds to a slice of the given column with the subject’s band). Given that the bands of the subjects are different, this cannot be fitted perfectly into the register, but reaches a register load efficiency of 80–90% for larger databases. All extensions are computed using 8-bit scores and are repeated when an overflow is detected, unless an alignment score of >255 is already known from previous stages.
Alignments are scored using the BLOSUM62 matrix by default. In addition, we also use a method of composition-based score adjustments 15 that is designed to increase the specificity of the scoring procedure. If required, DIAMOND (since v2.0.6) also supports applying the BLAST compositional matrix adjust scoring procedure 24 to compute BLAST-like alignment scores (options --comp-based-stats 3,4 ).
As an alternative, DIAMOND (v2.0.7) also includes the option to compute full-matrix instead of banded Smith–Waterman extensions (command line option --ext full), which are also vectorized using the SWIPE algorithm.
Reads produced by MinION technology 25 are known to be noisy and contain frequent indel errors, a problem that also translates to assemblies derived from such long reads. In consequence, genes cannot be detected reliably on such DNA sequences. DIAMOND addresses this issue by providing frameshift alignments in translated search (blastx) mode. The protein sequences corresponding to all three reading frames of a strand are aligned simultaneously against the target sequence, allowing shifts in the reading frame at any position in the alignment, while incurring a user-defined score penalty (set using -F on the command line). The raw MinION reads and contigs up to the length of full bacterial chromosomes are supported as input in translated search mode, enabling gene discovery and annotation in the absence of known gene boundaries.
Differentiating between true evolutionary relationships and spurious similarities presents a big challenge in remote homology detection, particularly given the repetitive nature of sequence regions found in many genomes. When dealing with an increasing load of available genomes for tree-of-life scale sequence searches, the ability to differentiate between similarity relationships based on sequence repetitiveness and homology based on a biologically meaningful sequence structure (non-repetitive sequence under purifying selection) becomes crucial to reduce the number of false-positive hits and increase alignment specificity at scale. Masking of low-complexity regions (repeat masking) is the most commonly used strategy to eliminate false-positive hits and to retain only hits found in biologically meaningful homologs. It has been shown that despite using the SegMasker tool included in BLASTP 26 , many more and stronger spurious similarities will arise than are expected on random sequences, as defined by an e-value threshold parameter 27 . DIAMOND reduces this false-positive bias by using more stringent and more sophisticated masking paradigms based on tantan. If required, the tantan masking can be replaced by the more conservative default BLASTP SEG masking and composition-based statistics using the option --comp-based-stats 3 (ref. 24 ).
As part of DIAMOND, our comprehensive sequence search framework supports a distributed-memory parallelization to leverage the computing power of state-of-the-art HPC and cloud-computing resources for massive-scale protein alignments. To this end, both the query database and the reference database are segmented into data packages that we refer to as chunks. The Cartesian product of both query and reference sets defines a (typically large) set of work packages. In the first instance, files containing metadata on these work packages are created centrally before a parallel run is started on independent computing nodes and are subsequently processed in a distributed manner by multiple worker processes of DIAMOND. Usually, only one worker process runs per compute node, efficiently utilizing all of the locally available cores via threads. Unlike related work such as mpiBLAST 28 , our implementation does not use any special interprocess communication libraries, such as the message passing interface (MPI) specific to HPC environments, instead it relies on input–output operations supported by any POSIX-compliant parallel file system that is mounted on all of the compute nodes involved. The advantage of this approach is that work packages are distributed in a self-organized way at run time to all participating worker processes using simple file-based stacks located in the parallel file system, with atomic push and pop operations. Once all database chunks for a specific query chunk have been processed, the final worker process involved in the query chunk takes on the role of performing the join operation to ultimately create the output stream. Note that the largest part of the temporary files stays local to a compute node, and only the lightweight work-stack files and the DIAMOND hits from the protein searches are written into the shared parallel file system. This strategy significantly reduces input–output overloads and enables massively parallel processing of DIAMOND runs. In addition to the lack of complex dependencies, such as on MPI, we highlight the particular advantages of our approach. First, there is no designated primary worker to induce a bottleneck due to synchronization, or to act as a potential single point of failure. Second, and by design, worker processes may join and leave at run time, which is less important on classical HPC systems that use batch systems to orchestrate potentially large numbers of processes, but is of striking advantage on elastic cloud-computing resources and on existing commodity resources such as networked laboratory desktop computers. Last, our transactional file-based work-distribution protocol enables fault tolerance, which means that if worker processes die unexpectedly, other processes in a subsequent run can take on and resume their work packages.
To create a benchmark database, we annotated the 14 September 2019 release of UniRef50 containing 37.5 million sequences with SCOP families. To categorize each protein sequence, we ran SWIPE 22 using an e-value cut-off of 10 −5 against the SCOPe ASTRAL40 v2.07 dataset 12 of domain sequences consisting of 4,850 protein families, which resulted in a collection of 7.74 million annotated protein sequences. We used the hit with the highest bit score per SCOPe fold (a grouping of structurally similar superfamilies) to infer the protein family annotation while allowing multidomain associations.
Given that DIAMOND requires a large query dataset to reach its maximum efficiency, we used an analogous SWIPE approach and annotated the NCBI nr database from 25 October 2019 in accordance with SCOPe families. We used UPGMA clustering 29 on the sets of all protein sequences annotated with the same superfamily to cluster and reduce them to a maximum of 1,000 sequences, which we selected as representatives of that superfamily, resulting in a benchmark dataset of 1.71 million queries.
Both query and reference sequences were locally shuffled in 40-letter windows outside the annotated ranges. All benchmark datasets and annotations have been published 30 .
Alignment for all tools was run on an AMD Ryzen Threadripper 2970WX 24-core workstation clocking at 3.0 GHz with 256 GB of RAM, except for the BLASTP (v2.10.0) run, which, due to its run time limitations on a desktop computer workstation, was performed on the Max Planck Society’s Draco supercomputer at Garching, Germany, using 24 nodes (32 cores on two Intel Haswell E5-2698v3 chips per node). On the benchmark machine the performance of BLASTP (v2.10.0) was estimated using a random subset of 10,000 queries sampled from the initial benchmark dataset.
For each query, we determined the AUC1 value, defined as the number of alignments against sequences matching the query’s protein family, divided by the total number of database sequences of that family (also called the coverage of the protein family). Only hits until the first alignment against a false positive were taken into account, which was defined as the alignment of query and subject sequences from different SCOPe folds. For multidomain proteins, the AUC1 value was averaged over the domains. The AUC1 values of the individual queries were again averaged over the query dataset to obtain the final sensitivity value (Fig. 1a). To ensure that a false positive is contained in the result list of every query, the tools were configured to report all alignments up to an e-value of 1,000 (Supplementary Information). Further information about the benchmark design can also be found in the Nature Research Reporting Summary.
Detailed assessment of sequence identities in true-positive alignments
We explored the sensitivity of all compared tools in more detail by resolving it at the level of amino acid sequence identity of true-positive alignments. For this purpose, we define the sequence identity of a query–subject association induced by annotation with the same SCOPe protein family as that obtained from the Needleman–Wunsch alignment between the pair of annotated ranges in the query and subject. Extended Data Figure 2 shows a breakdown of the AUC1 sensitivity for our main benchmark, computed as if the search space of positive cases were restricted to associations of the respective sequence identity ranges. Additionally, Extended Data Fig. 3 shows how a query sequence’s family associations are distributed across the identity bins for our benchmark dataset.
We report benchmark results for two additional datasets, consisting of sequencing reads from Illumina HiSeq 4000 paired end sequencing (2 × 150 base pairs) and Illumina HiSeq 2500 paired end sequencing (2 × 250 base pairs). The datasets were created based on data from a recent rumen metagenome study 31 (Supplementary Information, see Supplementary Benchmark 1) and an environmental study of the topsoil microbiome 32 (Supplementary Information, see Supplementary Benchmark 2). SCOPe-annotated datasets of 1.55 million and 1 million reads, respectively, were obtained as described in the Supplementary Information. The benchmark runs for the two query read datasets were carried out analogously to the run for our main benchmark, operating all tools in translated search mode against the same database of SCOPe-annotated UniRef50 sequences. We report performance, AUC1 values and ROC curves for both runs (Extended Data Figs. 4–7).
The ultimate ambition of DIAMOND v2.0.7 is to provide a comprehensive search framework for sensitive tree-of-life scale protein alignments in the Earth BioGenome Project era and beyond. Although BLAST-like sensitivity levels are the maximally achievable thresholds for pairwise alignments, the next focus of any aligner should be the computational scalability to process millions of sequenced species. With the new --ultra-sensitive mode introduced in DIAMOND v2.0.0 we achieve this critical BLAST-like sensitivity level while maintaining an 80-fold computational speedup, and we achieve an additional near-linear parallel speedup when using the custom DIAMOND HPC implementation. To simulate all facets of a tree-of-life scale protein search that is able to mimic future applications of large-scale comparative genomics projects, we performed DIAMOND --very-sensitive and --ultra-sensitive searches on 520 nodes of the Cobra supercomputer of the Max Planck Society (40 cores on two Intel Skylake 6148 chips, and 192 GB RAM per node), totaling 20,800 computing cores (41,600 threads), using the NCBI nr database (currently storing all sequenced proteins for
12,000 eukaryotic species and all proteins from
440,000 genomes of non-eukaryotic species) as the query database, and UniRef50 as the reference dataset. We randomly shuffled the sequences in both FASTA files to avoid a load imbalance due to a biased distribution of sequences in the original files. As a result, DIAMOND v2.0.0 produced 23.1 billion pairwise alignments in the --ultra-sensitive case and 23.0 billion pairwise alignments in the --very-sensitive case, starting from an initial query dataset that contained 281 million sequences and a reference dataset that contained 39 million subject sequences. In --very-sensitive mode the run terminated in 5.42 hours, while in --ultra-sensitive mode it terminated in 17.77 hours. The latter run is shown in Fig. 2 and Extended Data Fig. 1, demonstrating the massive parallelism achieved on the HPC infrastructure, as shown by the processing of individual tasks over time. Due to the parallel nature of the align and join operations, the parallel speedup is virtually linear and is limited only by the throughput of the shared parallel file system of the supercomputer used. This demonstrates that DIAMOND v2.0.0 can harness its algorithmic improvements and its new HPC support to cover all sequenced species in the tree of life within hours rather than months, while matching the alignment sensitivity levels of BLAST. The uncompressed output generated by this run occupies
1,100 GB of disk space and stores the 100 best protein hits for each sequence in the NCBI nr database.
We envision that in the future this type of DIAMOND output will be easily accessible to all life scientists via a web application in which users can filter and search for their protein homologs of interest within minutes across the tree of life on a precomputed dataset, instead of having to perform complex data analytics and months’ or years’ worth of BLAST searches to obtain sensitive protein alignments at this scale.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
9.7: The BLASTP algorithm - Biology
List of genome annotation tools
Hybrid = ab initio and evidence based = HMM-based gene prediction tool using extrinsic evidence
Comparative = genome sequence comparison
CHMM: class HMM CRF: conditional random field HMM
DBN: Dynamic Bayes network
DP: dynamic programming
EHMM: evolutionary HMM
GHMM: generalized HMM
GPHMM: generalized pair HMM
HMM: hidden Markov model
IMM: Interpolated Markov model
LDA: Linear Discriminant Analysis
MDD: maximal dependence decomposition
ML: maximum likelihood
MM: Markov Model
NN: Neural Networks
PHMM: pair HMM
phyloHMM: phylogenetic HMM
RBFN: Radial Basis Function Network
SVM: support vector machine
WAM: weight array matrix
Principles of Gene Manipulation and Genomics. De Sandy B. Primrose, Richard Twyman
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.