I have a list of 100+ proteins and I need to conduct psi-blast for each one of them against "bacterial proteins" only.
I went to NCBI's protein blast tool, but couldn't figure out how to select/limit the targeted database.
Here are the 2 configs I tried:
Can anyone tell me if either of the approaches is correct? If not, please point the right way.
Edit 1 - add more details
All of my data is from "Supplementary Table 10" of paper "Comparative genomics of the neglected human malaria parasite Plasmodium vivax".
About 5% of all my protein sequences are ISS predicted. Remaining sequences are IEA predicted
Your approach is correct but it is worth considering which database will work best for you. Questions to consider are;
- Does the database provide a large search space for the organism(s) you're interested in?
- Do its IDs work well for searching other databases?
- How does the host database handle identifiers over time?
For protein sequences I would use;
Organism bacteria (taxid:2) (or a subset)
The reasons for this are that I am comfortable with how Uniprot handles their ID's, they provide a large search space with includes clear delineation between reliable (reviewed) and other (unreviewed) sequences, and finally UniprotKB IDs map well to genes. The latter can be problematic. I also know that I won't have trouble searching PDB with UniprotKB IDs.
Despite using and liking the Protein data bank (PDB) I wouldn't use them for this kind of search as they are a structural database which reduces your search space. PDB has 8292 structures for E.coli compared to 23 017 reviewed and 1,335,860 unreviewed sequences in Uniprot. If you are only interested in structures then the PDB is an ideal fit.
I haven't used landmark.
For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease–endonuclease–phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/.
Certain protein sequence analysis methods require as input a large number of multiply aligned sequences. This includes, for instance, direct coupling analysis (DCA) ( 1–7), which uses statistical modeling to predict 3D interacting residue pairs. DCA typically requires at least a thousand and, ideally, tens of thousands of aligned sequences. Larger multiple sequence alignments (MSAs) are required for deep neural networks and for other machine learning algorithms, such as Bayesian partitioning with pattern selection (BPPS) ( 8). BPPS, like DCA, identifies statistical correlations in an MSA but, unlike DCA, focuses on sets of subgroup-specific co-conserved residues associated with functional specialization rather than on pairwise correlations. For BPPS, at least tens of thousands of aligned sequences are required to obtain accurate subgroup models, each of which might be based on at least a hundred sequences. This process defines sub-MSAs that are useful for performing DCA on superfamily subgroups ( 9).
Of course, to best distinguish signal from noise, sequences also need to be aligned accurately. In this regard, automated MSA programs are far from optimal ( 10–14) due to the unpredictable nature of protein evolution. Consider, for example, a conserved arginine residue that is just beyond a guanine-binding NK.D motif in Ran, Rab, Ras and Rho GTPases and that forms a salt bridge with a conserved acidic residue ( 15). Within some of these GTPases, a deletion (of up to three residues) directly precedes this arginine and an insertion (of up to 15 residues) directly follows it ( 16)—making it more or less impossible to align correctly without manual curation. Moreover, distinct subgroups within a large superfamily typically harbor insertions and deletions relative to other subgroups thereby confounding MSA methods. For these reasons, the National Center for Biotechnology Information manually curates hierarchical MSAs (hiMSAs) for hundreds of widely distributed and diverse protein domain superfamilies. RPS-BLAST ( 17) searches against this conserved domain database (CDD) ( 18) identify domains within a query sequence and the subgroup to which each domain belongs.
Curated hiMSAs have other useful applications as well, including serving as benchmark sets ( 14). Another application, which we highlight here, is as queries for the MAPGAPS (Multiply-Aligned Profiles for Global Alignment of Protein Sequences) program ( 16) to thereby create extremely large and accurate MSAs. Here we announce the public availability of (i) a set of CDD hiMSAs, (ii) a suite of programs for creating from each hiMSA an extremely large MSA and (iii) a set of MSAs obtained in this way. The software package contains MAPGAPS (v2.1) and utilities for format conversion, taxonomic annotation and the removal of undesirable sequences from the output MSA. Phylum annotation and species identifiers are particularly useful for both DCA and BPPS. For DCA, species identifiers facilitate prediction of residue 3D contacts across a protein–protein interface by identifying, for each species, pairs of interacting proteins. For BPPS, this facilitates both co-analysis of interacting proteins and characterization of subgroup phylogenetic diversity to confirm that residue patterns are due to persistent evolutionary constraints rather than merely due to recent common descent. Of course, having an extremely large and accurate MSA also provides additional statistical power for DCA and BPPS and is necessary for machine learning based on deep neural networks.
Availability of CD hiMSAs and of corresponding MSAs
Version 3.17 of the CDD provides over 57 000 position-specific scoring matrices and MSAs corresponding to ancient conserved domains (CDs) for individual protein subgroups that span diverse organisms. Subgroup alignments are kept consistent throughout each superfamily hierarchy, which allows each of these to be mapped onto its typically less extensive `parent’ alignment. CDD subgroups correspond to major branches within sequence trees. CDD alignments are manually curated based on both sequence homology and structural superposition. Superfamily members share a common structural core but often perform diverse biochemical or cellular functions. Each sequence assigned to a given subgroup is also assigned (either explicitly or implicitly) to its parent subgroups. For each subgroup, a consensus sequence is defined based on the most frequent residue per column after weighting for sequence redundancy.
Here we announce the availability (at ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/hiMSA) of 252 curated CDD hiMSAs and of MSAs derived from these. These hiMSAs consist of at least 10 nodes, and their derived MSAs contain at least 1000 sequences after removing sequence fragments matching <75% of the aligned columns and reducing sequence identities to ≥98% the average length of these MSAs is 157 columns and contain an average of 50 nodes and 106 109 sequences. Each superfamily hiMSA comes as a set of text files in mFASTA format along with template files that define how to globally align each subgroup MSA against the other subgroup MSAs from the same superfamily. Each superfamily’s file is in a subdirectory named after its identifier (e.g. cd08372). The numbers of sequences in the corresponding MSAs are given in Table S1 . Although existing methods may align a small number of sequences with comparable accuracy, aligning the numbers of sequences obtained here overwhelms these programs. This includes two recent programs designed to align larger numbers of sequences ( 19, 20) these failed to align most of the sequence sets in Table S1 using the maximum amount of RAM available on our systems.
Using hiMSAs to obtain MSAs
The software needed to construct an extremely large, accurate and taxonomically annotated MSA from the corresponding hiMSA is available at http://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/. The procedure, which is outlined in Figure 1, involves the following. (i) The CDD2MGS program converts an mFASTA formatted CDD hiMSA into MAPGAPS’s input format. (ii) Using the hiMSA as a query, MAPGAPS searches a large sequence database, such as the FASTA-formatted BLAST nr database (found at ftp://ftp.ncbi.nih.gov/blast/db/FASTA/). (The AddPhylum program can be used to label nr sequences by phylum and kingdom, which is useful for BPPS.) It is recommended to first split a large sequence database into smaller FASTA files using the program fasplit. MAPGAPS can then be run in parallel on each of the split files. Splitting the current version of nr into subsets of 250 000 sequences results in about 750 sub-files. After all subsets have been processed, the MAPGAPS output files are concatenated into a single file. (iii) The program PurgeMSA merges the output files while removing short sequence fragments and redundant sequences. We suggest removing those sequences matching <75% of the aligned columns and retaining all but one sequence among those sharing ≥98% sequence identity. In addition to the default MAPGAPS cma-formatted MSA, which requires relatively little memory and disk space, the ConvertMSA program can convert this into mFASTA-format, which, however, requires significantly more storage space and for large superfamilies may exceed available memory. ConvertMSA can also convert mFASTA-formatted MSAs into cma format and cma-formatted MSAs into rich text format, such as in Figure 3.
Steps required to create a large, high-quality MSA using the CDD hiMSAs and programs described here.
Dataset and template library
All the sequence alignment programs are benchmarked on the same set of 538 non-redundant proteins randomly collected from the PDB library 3 . These proteins have a pair-wise sequence identity less than 30% and length ranging from 34 to 804 residues. Proteins with broken chains or missing residues were not included. The sequences were divided into three categories: Easy, Medium and Hard targets, based on the consensus confidence score of the meta-threading LOMETS program 44 , which consists of 9 protein threading programs (dPPAS, MUSTER, HHsearch-I, HHsearch-II, PPAS, PROSPECT, SAM, SPARKS and SP3). A target is defined as Easy if at least one strong template hit can be detected for the target by each program with the Z-score higher than the confidence cutoff a target is defined as Hard if none of the threading programs has a strong template hit otherwise, it is considered a Medium target. In total, the 538 proteins are selected to include a balanced category distribution of difficulty with 137 Easy, 177 Medium and 224 Hard targets. Here, we have put more focus on the challenging targets by arbitrarily increasing the number of Medium and Hard proteins in our benchmark protein set, although a naturally collected sample from the PDB would have a much lower portion of Medium/Hard cases. A list of all the 538 proteins, together with the classification, are provided in http://zhanglab.ccmb.med.umich.edu/publicdata/benchmark1/protein_types.txt.
The existence of template structures in the library is a precondition for template identification. To eliminate potential bias of the alignment algorithms from the template structure library, we constructed the libraries of all threading programs using the same sequence identity cutoff updated to the same time stamp (by Jan, 2013). In fact, the template libraries for NW-align, SW-align, BLAST, PSI-BLAST, PSA, PPA, PPAS, dPPAS, MUSTER, SAM, PRC, PROSPECT, SPARKS, SP3 and FFAS are generated from the same set of non-redundant PDB proteins with a pair-wise sequence identify cutoff 70% (see http://zhanglab.ccmb.med.umich.edu/library/). The libraries for HHsearch-I and HHsearch-II are downloaded from ftp://toolkit.lmb.uni-muenchen.de, which has also a sequence identity cutoff of 70%. The size of these two libraries is about the same. The programs of all the tested methods are described in METHODS.
Summary of performance by individual alignment methods
Table 1 presents a summary of the 3D structural models, which are built by copying the framework of the highest ranked and the best in the top ten scoring templates based on the alignments generated by different alignment programs. The quality of alignments is generally measured by the root-mean-square deviation (RMSD) of the models (Columns 4–5), where BLAST, PSI-BLAST, PRC, FFAS and HH- search programs have the lowest RMSD to the targets (
7–9 Å). However, the alignments by these programs tend to have a smaller number of residues aligned (i.e. lower alignment coverage, Columns 6–7), typically below 80%. Such short alignments can have a negative impact on the full-length structure construction by homology modeling since structure information is missed for a large portion of unaligned sequences. In fact, the full-length models by MODELLER 45 have a very high RMSD (>20 Å) for all these local alignment methods (see values in parentheses). Here, the full-length models ware generated by the script model-default.py in the MODELLER package. The modeling results from MODELLER are deterministic in the sense that more runs do not change the quality result of the final models.
In Columns 2–3, we also list the result of the alignment models on TM-score, which is defined to combine the alignment accuracy and coverage as a unique score 46 :
where L is the length of target sequence, Lali the number of aligned residues and di the pair-wise distance of ith residue in the model and target after the optimal superposition. In this scoring function, the programs, which have a better balance of alignment coverage and RMSD, excel, including MUSTER, HHsearch and PPAS programs. The simple sequence-to-sequence based alignment algorithms generally have a lower TM-score.
Meanwhile, the local-to-local alignment based algorithms generally have a lower coverage and TM-score compared to the global-to-global alignment methods. A typical example is SW-align based on Smith-Waterman, which only identifies the highly conserved regions and has on average 56% residues aligned, while Needleman-Wunsch based NW-align uses the same parameter and scoring function but generates alignments with a much higher coverage (84.9%). Accordingly, the TM-score of NW-align is 21.1% higher than that of SW-align. The completeness of alignment searching also plays a role in final model determination. For instance, both BLAST and SW-align are local sequence alignments based on BLOSUM62 mutation scores. But BLAST searches are based on an incomplete heuristic word search algorithm, which has an average TM-score 7% lower than SW-align. BLAST is however 39 times faster than SW-align in our test.
Although TM-score aims to balance the accuracy and coverage of alignments, it still favors algorithms that have a higher coverage, since including additional residues in the alignments always has a positive contribution to TM-score according to Eq. 1, although the contribution is small if the added residues from templates are far away from the target. To examine the effect of such bias, we constructed full-length models of the targets based on the alignments, using the widely-used comparative modeling tool MODELLER 45 . Although TM-score is now normalized by the same length of the target sequence, the TM-score ranking of full-length models is largely consistent with that of the original alignments, except for some small but notable variations. Taking the top hits as an example, the original alignments by HHsearch-I have a lower TM-score than those by dPPAS (0.422 vs. 0.426) due to the low coverage of the sequences (76.3% vs. 81.9%). After full-length modeling, the TM-score of HHsearch-I becomes higher than that of dPPAS (0.444 vs. 0.438) and several other related algorithms (e.g. PPAS and SP3). Here, the more precise alignments by HHsearch-I in the aligned regions have probably introduced some restraint/guidance to the structure modeling of the unaligned regions, e.g. through bond-length and change connectivity, which have resulted in models of a higher overall TM-score.
Performance of sequence alignment programs in different target categories
The performance of different alignment programs varies with the difficulty of the targets, i.e. the evolutionary distance between target and template proteins. If we use the target structure as a probe to search through the PDB library by TM-align 47 , the average TM-scores of aligned regions of the best structural templates in the three categories of Easy, Medium and Hard are 0.779, 0.666 and 0.586, respectively, after excluding homologous templates with a sequence identity > 30%. This data on one hand sets up an upper-bar for template identifications by fold-recognition on the other hand, it demonstrates that the target category as defined by the LOMETS prediction is largely consistent with the actual difficulty of the template identification for the targets.
In supplementary Tables S1, S2 and S3, we summarize the results of different programs on the Easy, Medium and Hard targets, respectively. Figure 1 is the histogram of the average TM-score achieved by different programs. As shown in Table S1, HHsearch programs generate the highest TM-score in the Easy targets. MUSTER and other structure-assisted alignment methods (dPPAS, SP3 etc) generally outperform the HHsearch programs in the Medium and Hard targets. This data demonstrates the usefulness of structure-based features in detecting the distant homologous templates.
TM-score histogram of the top hits identified by different algorithms in Easy, Medium and Hard categories.
Specificity of alignment programs
Except for the accuracy of the template alignments (or sensitivity), the specificity of the alignments (i.e. the correlation of the scoring function and the accuracy of the final alignments) is another important measurement of the alignment algorithms, as this correlation essentially decides how the results can be used in the comparative structural modeling and function annotations.
In Figure 2, we present the TM-score data of the highest ranked alignment models versus the alignment scores by the 20 alignment programs. Here, we tried both the default alignment score of the programs and the Z-score (defined as the difference between the raw alignment score and average in units of standard deviation) and chose the one with the highest correlation to the TM-score of the final models to present in the plot. As expected, positive correlations are observed for all the alignment programs, with PPAS, SAM and MUSTER having the highest Pearson correlation coefficients (0.789, 0.787 and 0.782, respectively). The NW-align and BLAST programs have the lowest correlation coefficient because a number of targets have a high alignment score but with low quality (TM-score < 0.5), indicating a low specificity of the programs.
TM-score of full-length models of 20 threading methods on 538 non-homologous proteins versus the alignment scores.
Easy, Medium and Hard targets are colored blue, green and red, respectively. PSI-BLAST, BLAST and PRC use bit score and others use z-score to score the alignments.
We also mark in Figure 2 an alignment score cut-off that minimizes the false positive rate, FPR = FP/(FP + TN) and the false negative rate, FNR = FN/(TP + FN), where a model of TM-score > 0.5 is defined as a positive hit that has the correct fold 48 . The score cut-offs, false positive and false negative rates are listed in Table 2. The programs with alignment score that are calibrated by the statistics of random samples, including PSI-BLAST, SAM and FFAS, have the lowest FPR + FNR values, i.e. the highest specificity, based on this measurement. Meanwhile, the Easy and Hard targets are clearly grouped in the right-up and left-bottom regions in Figure 2 for all programs, demonstrating the dependence of the performance of the alignment algorithms on the evolutionary distance of target and templates.
Profile-based alignments versus sequence-based alignments
Depending on whether the homologous sequences are included in the target-template alignments, the alignment methods can be grouped into the three general categories of sequence-to-sequence alignment (including NW-align, SW-align and BLAST), sequence-to-profile alignment (PSI-BLAST, SAM and PSA) and profile-to-profile alignment (PRC, HHsearch-I, HHsearch-II, PPA, PPAS, dPPAS, MUSTER, PROSPECT, SPARKS, SP3 and FFAS). Since the sequence profiles derived from multiple sequence alignment of protein families contain important information of conserved/diverged locations along the sequences, the profile-based alignments can generally generate more accurate target-template alignments than that made by single sequence-based alignments 16,49 .
Such insight is also observed in our data analysis. As shown in Table 1 (rows highlighted in bold), the average TM-score obtained by the sequence-to-profile based methods is 18.4% higher than the TM-score from the sequence-to-sequence based methods. Similarly, the TM-score from profile-to-profile alignment methods is 49.8% higher than that of sequence-to-sequence based methods. These increases in TM-score are not only due to the higher coverage of alignments (81.8% vs. 62.6%), but also the enhanced accuracy of alignments as the average RMSD is reduced slightly in the profile-profile methods from 10.4 to 10.2 Å. This tendency is also seen in Tables S1,S2,S3 where the targets were categorized into different groups of Easy, Medium and Hard, demonstrating that the profile-based alignments enhance both close and distant homology identifications.
Two types of sequence profiles, PSSM and HMM, are often exploited in various alignment methods. Given a MSA, the PSSM profile is designed to account for the estimated frequency of amino acids at each position, while the HMM profile accounts for both amino acid frequency and position-specific probabilities for insertion and deletion. Although the HMMs seem to contain additional gap information from MSAs, there is no obvious difference between the HMM-based (e.g. HHsearch-I and –II) and PSSM-based alignment algorithms (e.g. PPAS), in terms of the TM-score of the alignment models (Table 1). However, HMM-based methods did generate higher TM-scores than PSSM-based methods in the Easy targets (Table S1). Additionally, the HMM-based methods have generally a lower RMSD and lower coverage of alignments, indicating that the HMM method is more sensitive in detecting local structural motifs and scaffolds.
Meanwhile, there are a number of targets that have the correct templates identified by either HMM- or PSSM-based methods (but not both), demonstrating that these two types of methods are complementary to each other. This complementarity from multiple alignment algorithms is essential to the success of meta-server based structure modeling approaches 44,50 .
How much space is left for improvement by structural feature prediction?
The performance of profile alignments could be further improved by incorporating structural information. One example is secondary structure comparison, which has been used by almost all contemporary alignment/threading methods to guide the target-template alignments. As a quantitative test of the impact of secondary structure information on alignment accuracy, we developed two sequence profile-profile based methods, PPA and PPAS, where the only difference is that PPAS contains a secondary structure match in the scoring function but PPA doesn't (see Eqs. 3 and 4 in METHODS). As a result, the inclusion of secondary structure prediction increases the TM-score of PPA by 6.8%. MUSTER is another typical profile-profile alignment based algorithm that incorporates multiple composite structure features in the alignments, including secondary structure, residue depth, solvent accessibility and backbone torsion angle predictions. These features result in a TM-score increase of 9.6% compared to the PPA method, corresponding to a p-value < 10 −14 in paired student t-test.
The performance of the structure feature assisted algorithms relies on the accuracy of structure feature predictions for the target sequence. In our test on the 538 non-homologous proteins, the average Q3 score (three structure states per residue overall accuracy) for PSSpred and PSI-pred is 83.1% and 80.7%, respectively the mean absolute errors in ψ and φ angle predictions are 28° and 41°, respectively and the Pearson correlation coefficient correlation between predicted and actual solvent accessibility is 0.678. Incorrect assignments of the structure features can compromise the performance of MUSTER. In fact, we observed a number of cases where the TM-score of the alignments by MUSTER, which considers additional structural features, is lower than that of PPAS.
In order to explore the potential of the alignment improvement obtained by considering structural features, we implemented MUSTER using the native structure features derived from the target structures, where the weighting parameters were re-optimized in a separate test of 100 proteins. As shown in Table 1, the average TM-score of the full-length models from MUSTER alignments showed a gradual increase from 0.449 to 0.511, when we exploited more native structure features from secondary structures (MUSTER SS ), backbone torsion angle (MUSTER SS+BTA ) and solvent accessibility (MUSTER SS+BTA+SA ). This change corresponds to an overall increase of 13.8% in the average TM-score.
In Figure 3, we present an illustrative example from the PP7 bacteriophage coat protein (PDB ID: 2qudA), which has the secondary structures arranged as β1-β2-β3-β4-β5-β6-α1-α2 from the N- to C-terminals. The PSI-pred method has however mis-assigned most secondary structure elements in 12S-90M (see ‘*’ in Figure 3A), which resulted in the first two beta-strands (7T-11S and 15A-25T) in the target mis-aligned to the coiled regions in the template 1qbeB by MUSTER. This alignment has a TM-score = 0.607. When using the actual secondary structure assignment, the MUSTER SS program correctly matches the two beta strands of the target with the strands on the template. Based on the same template, the correction of the secondary structure comparison increases the TM-score of the model to 0.7 in this example.
The illustration of template identifications for 2qudA.
(A) MUSTER with predicted secondary structure (B) MUSTER SS with native secondary structure. The experimental structure and MUSTER models are shown in red and green cartoons, respectively and the first two beta-strands in (A) are in yellow on the template. The secondary structures are labeled as ‘C’ for coil, ‘E’ for strand and ‘H’ for helix, where ‘Pred’ and ‘Obs’ denotes the PSI-pred prediction and the native, respectively. ‘*’ in (A) marks the residues with mis-predicted secondary structure.
Despite the significant increase in alignment accuracy brought by the integration of structure features, the quality of the alignments using the best structure features from the native is still far from the best templates detected by structural alignments, i.e. the average TM-score by TM-align is 37.1% higher than that by MUSTER SS + BTA + SA (Table 1), which demonstrates considerable room for further alignment improvement. The gap is relatively small in Easy targets (7.4%) according to the data in Table S1, which indicates that the current state-of-the-art alignment methods generate nearly optimal alignments for close homology targets. But for the Medium and Hard targets, the gaps become highly significant, which correspond to a TM-score difference of 35.6% and 79.2%, respectively (Tables S2 and S3). Apparently, such gaps cannot be filled by solely improving the structure feature prediction methods and a completely different alignment system based on novel scoring and alignment schemes might be required.
CDvist is designed to provide maximum domain coverage in protein sequences by bundling the best current domain search tools into a pipeline that exhaustively searches through a series of domain databases in an iterative fashion. This methodology yields the most comprehensive domain architecture for a given protein sequence. Rich visualization, download options and linear speed-up for bulk queries should be appealing to both biologists and bioinformaticians. This webserver would be especially useful for multi-domain proteins with rare or unique domain architectures and those prone to domain swap, where whole sequence similarity searches often yield uninformative and misleading results ( Iyer et al., 2001).
Twenty threading/alignment methods, covering different categories of target-template alignment algorithms and possible to install at local computers, are benchmarked in this article. All algorithms without cited references are newly developed in house and first presented in this study.
NW-align is a sequence-to-sequence alignment program constructed based on the standard Needleman-Wunsch dynamic programming algorithm 10 . The amino acid mutation matrix is from BLOSUM62 52 with gap opening penalty = −11 and gap extension penalty = −1.
SW-align is a sequence-to-sequence alignment program using a similar setting as NW-align but with dynamic programming based on the standard Smith-Waterman algorithm 53 . The major difference from NW-align is that the negative score values are set to zero and the alignment trace-back procedure starts from the highest scoring cell and ends with a cell of zero score in SW-align. This setting allows SW-align to identify subsequence motifs having the highest local sequence similarity. The source codes and the executables of both NW-align and SW-align are available at http://zhanglab.ccmb.med.umich.edu/NW-align.
BLAST 13 is a local sequence alignment tool based on a heuristic searching method, where high-scoring segment pairs (HSPs, or words) are first identified by gapless comparisons. The final alignments are constructed by extension and connection of the HSP regions. The heuristic algorithm in BLAST is often suboptimal but much faster than the optimal dynamic programming algorithms.
PSI-BLAST 14 is a sequence-to-profile alignment program extended from BLAST which aims to increase the alignment sensitivity of distant homologous proteins by iterative MSA search. It first collects a list of close homologous sequences from a reference database (e.g. NCBI non-redundant sequence database, NR) by BLAST. A PSSM is then derived from the MSA of the sequence homologies, which is used to search against the reference database again to identify a newer set of homologous sequences. The procedure can be repeated a number of times until the PSSM profiles converge. In our test, PSI-BLAST was searched against NR database for 3 iterations using an E-value cutoff, which assesses the significance of the HSP score, below 0.001.
PSA is sequence-to-profile alignment algorithm based on the Needleman-Wunsch dynamic programming. The scoring function of the ith position in the query (q) aligned with the jth position in the template (t) is
where Fq(i, k) represents the frequency profile of kth amino acid at ith position of the query. Bt(k, j) denotes a BLOSUM mutation score between the amino acid k and jth residue of the template. The shift parameter is introduced to avoid the alignment of unrelated residues in the local regions. Parameters of shift (−0.01), gap opening (go, −8.6) and gap extension (ge, −0.9) penalties were optimized on the ProSup dataset 54 .
PPA is an in-house profile-profile alignment method on the Needleman-Wunsch algorithm. The scoring function is defined by
where Fq(i, k) and Lt(j, k) stand for the sequence frequency profile of query and the log-odds profile of template, respectively. To build the sequence profiles, the sequences are searched against the NR by 3 PSI-BLAST iterations, at an E-value cutoff 0.001. The Henikoff weighting scheme 55 is then used to generate frequency or log-odds profiles. Similarly, the parameters of shift (−0.94), go (−6.8) and ge (−0.52), are optimized by trial and error using the ProSup dataset.
PPAS is an in-house profile-profile alignment method that combines profile log-odds score and secondary structure comparison. The scoring function is defined by
where ScorePPA(i, j) is defined in Eq. 3, δ(Sq(i), St(j)) is the Kronecker delta function to assess the secondary structure match between target and template. Sq(i) is the secondary structure of the ith residue on the target predicted by PSSpred (http://zhanglab.ccmb.med.umich.edu/PSSpred) and St(j) is the secondary structure of the jth residue on the template structure assigned by DSSP. A position-specific gap penalty scheme is used in the alignment search, i.e. no gap is allowed inside the secondary structure regions, go and ge penalties apply to other regions and the ending gap-penalty is neglected. The four parameters C1 (0.65), shift (−0.96), go (−7.0) and ge (−0.54), are optimized for PPAS in a similar way as PPA.
dPPAS is an in-house profile-profile alignment program extended from PPAS. The only difference from PPAS is that a structure fragment depth profile is added in dPPAS to enhance the alignments, i.e.
where Fq(i, k) and Lt(j, k) are defined in Eq. 3. Lstr(j, k) is a frequency depth profile derived from a set of structural fragments that have similar depth as the fragment at jth position of the template 17,19 . Similarly, the parameters (C1 = 6.5, shift = −0.96, go = −7.0 and ge = −0.54) are optimized on the ProSup dataset.
MUSTER 17 is a profile-profile based threading program which combines multiple sequence and structure matching information. In addition to the sequence profiles obtained by PSI-BLAST searches, the scoring function contains secondary structure match (SS), fragment depth profiles, solvent accessibility (SA), backbone torsion angles (BTA) and hydrophobic scoring matrix. The optimal alignment is generated by Needleman-Wunsch dynamic programming. Compared to dPPAS, MUSTER contains additional terms from SA, BTA and hydrophobic scoring matrix matches, whereby the weighting parameters are re-trained by a new dataset.
To further examine the potential of structure-assisted threading algorithms, we developed three variants of MUSTER programs, MUSTER SS , MUSTER SS + BTA and MUSTER SS + BTA + SA , which exploit the SS, BTA and SA features extracted from the experimental structures of the target. Similar to MUSTER, all parameters in these algorithms are optimized in a separate training set of 100 non-redundant proteins by maximizing the TM-score.
SAM 56 is a hidden Markov model (HMM) based protein fold-recognition method. Starting from the PSI-BLAST search, SAM constructs a HMM profile based on the iterative MSA searches. The HMM profile is then used to search through the PDB library to identify structural templates. SAM can conduct both local and global alignment searches and we use the local alignment mode in this work.
PRC 57 is a program for scoring and aligning profile HMMs. To run PRC, we first construct HMMs of both target and template sequences by SAM 56 . The HMM–HMM based alignments are then computed and ranked by PRC which is designed to find the Viterbi path that maximizes the sum of forward–backward odds scores.
12. HHsearch-I and HHsearch-II
HHsearch 24 is a HMM-HMM based alignment program which combines the profile log-odds score and the secondary structure prediction in the Viterbi dynamic programming. We run two versions of HHsearch: HHsearch-I uses PSI-BLAST to start the MSA search for building the profile HMMs for target and template sequences, while HHsearch-II uses HHblits to construct the profile HMM for target sequences. HHblits uses a discretized-profile prefilter that can generate HMM profiles faster than PSI-BLAST 58 . The final query-template alignments are constructed by the same HHsearch program. Both HHsearch-I and HHsearch-II are in the local alignment mode.
PROSPECT 21 is a sequence profile-profile alignment algorithm assisted with a residue-level contact potential and SS predictions. A global optimization of target-template alignment is generated by the divide-and-conquer searching method.
14. SPARKS and SP3
Both SPARKS 18 and SP3 19 were developed in Zhou Lab. In SPARKS, the authors exploit a sequence profile–profile alignment combined with a single-body statistical potential in SP3, they use a residue depth dependent structure profile to replace the single-body potential used in SPARKS.
FFAS 59 is a sequence profile-profile based alignment program. It calculates the sequence profile by PSI-BLAST searching against the NR85s database with 5 iterations. A dot-product scoring function is then used to align two sequence profiles. The alignment score is finally translated into a statistical measure by comparing it with the distribution of scores obtained for pairs of unrelated proteins.
Comparing the SARS-CoV and SARS-CoV-2 spike protein structures locally
At the end of the last section, we compared the validated SARS-CoV and SARS-CoV-2 spike proteins. However, having just a single RMSD value does not greatly help us understand what the individual changes have been that have differentiated the SARS-CoV-2 spike protein. That is, we need to move toward comparing two structures locally.
In keeping with the biological maxim that the structure of a protein informs the function of that protein, can we find any clues lurking in the spike proteins’ structure that would indicate why the two viruses behave differently in humans? Why did SARS-CoV fizzle out while SARS-CoV-2 was infectious enough to cause a pandemic?
In class, we introduced a similarity measure called Q per residue (Qres), which measures the similarity of two protein structures at the i-th alpha carbon. Qres is a similarity metric ranging between 0 and 1, with low scores representing low similarity between two proteins at the i-th position, and higher scores representing high similarity at this position.
We now will compute Qres for the SARS-CoV and SARS-CoV-2 spike proteins, limiting our attention to a highly variable subregion of these proteins called the receptor binding domain (RBD). We will use the VMD plugin Multiseq, a bioinformatics analysis environment, to align the SARS-CoV-2 RBD and SARS RBD having the PDB entries 6vw1 and 2ajf, respectively. After calculating Qres at every position, we will identify where the two RBD regions differ and visualize these regions within the larger protein structures.
It is important to note that the The experimentally verified SARS-CoV-2 structure is a chimeric protein formed of the SARS-CoV RBD in which the RBM has the sequence from SARS-CoV-2. A chimeric RBD was used for complex technical reasons to ensure that the crystallization process during X-ray crystallography could be borrowed from that used for SARS-CoV.
Multiseq aligns two protein structures using a tool called Structural Alignment of Multiple Proteins (STAMP). Much like the Kabsch algorithm discussed in class, STAMP minimizes the distance between alpha carbons of the aligned residues for each protein or molecule by applying rotations and translations. If the structures do not have common structures, then STAMP will fail. If you are interested in more details on the algorithm used by STAMP, click here.
First, download VMD. In what follows, the program may prompt you to download additional protein database information, which you should accept.
We will need to download the .pdb files for 6vw1 and 2ajf. Visit the 6vw1 and 2ajf PDB pages. For each protein, click on Download Files and select PDB Format . The following figure shows this for 6vw1.
Next, launch VMD, which will open three windows. We will not use VMD.exe , the console window, in this tutorial. We will load molecules and change visualizations in VMD Main , and we will use OpenGL Display to display our visualizations.
We will first load the SARS-CoV-2 RBD (6vw1) into VMD. In VMD Main , go to File > New Molecule . Click on Browse , select your downloaded file ( 6vw1.pdb ) and click Load .
The molecule should now be listed in VMD Main , with its visualization in OpenGL Display .
In the OpenGL Display window, you can click and drag the molecule to change its orientation. Pressing ‘r’ on your keyboard allows you to rotate the molecule, pressing ‘t’ allows you to translate the molecule, and pressing ‘s’ allows you to enlarge or shrink the molecule (or you can use your mouse’s scroll wheel). Note that left click and right click have different actions.
We now will need to load the SARS-CoV RBD (2ajf). Repeat the above steps for 2ajf.pdb .
After both molecules are loaded into VMD, start Multiseq by clicking on Extensions > Analysis > Multiseq .
You will see all the chains listed out per file. Both .pdb files contain two biological assemblies of the structure. The first is made up of Chain A (ACE2) and Chain E (RBD), and the second is made up of Chain B (ACE2) and Chain F (RBD). Because Chain A is identical to Chain B, and Chain E is identical to Chain F, we only need to work with one assembly. (We will use the second assembly.)
Because we only want to compare the RBD, we will only keep chain F of each structure. To remove the other chains, select the chain and click Edit > Cut . You can remove the nucleic structures as well.
The protein structures are visible, but we need to overlap them in order to compare them visually. Click Tools > Stamp Structural Alignment , and a new window will open up.
Keep all the default settings and click OK once you have done so, the RBD regions will have been aligned into a single structure.
Now that we have aligned the two RBD regions, we would like to compare their Qres values over the entire RBD. To see a coloring of the protein alignment based on Qres, click View > Coloring > Qres .
Scroll across the sequence alignment (in the untitled.multiseq window) to see how Qres changes over positions. Blue indicates a high value of Qres, meaning that the protein structures are similar at this position red indicates low Qres and dissimilar protein structures.
If we zoom in on the region around position 150 of the alignment, then we can see a 13-column region of the alignment within the RBD region for which Qres values are significantly lower than they are elsewhere. This region, which is shown in the figure below, corresponds to positions 476 to 486 in the SARS-CoV-2 spike protein and positions 463 to 472 in the SARS-CoV genome.
The OpenGL Display window will now color the superimposed structures according to the value of Qres. We are interested in regions of the alignment having low Qres, which correspond to locations in which the coronavirus RBDs differ structurally. Note that these regions occur on the boundaries of the RBD’s structure, where bonding is more likely to occur.
Exercise: There is another region in the interior of the RBD structural alignment that shows significant differences between the two protein structures. Where is it?
We won’t ask you to do this, but we can color the alignment of the proteins complexed with ACE2 according to Qres, which produces the figure below. (The ACE2 enzyme is shown in green.) The low-Qres region of the RBM alignment that we highlighted in the above figure is outlined in the figure below.
The rich set of structural and functional characteristics derived for each protein as well as the high degree of automation and advanced analytical features make the PEDANT database a useful tool for structural genomics. In particular, PEDANT can be used to facilitate the target selection process. Using the sequence clustering results described above it is easy to judge the domain structure of the protein families. Further, circular diagrams visualize available structural information on each cluster member (domains with known three-dimensional structure, transmembrane regions). Based on these pre-computed results we have created an efficient target selection tool called STRUDEL [STRucture DEtermination Logic ( 22 )]. A Web-based interface for this tool allowing PEDANT users to select structural targets of interest according to specified criteria is currently being developed.
One main approach to taxonomic and functional binning of microbiome shotgun sequences is based on protein homology (Glass, Wilkening, Wilke, Antonopoulos, & Meyer, 2010 Huson, Auch, Qi, & Schuster, 2007 ). In this approach, the sequences are first aligned against a reference database of protein sequences of known taxonomic and functional identity, and then the resulting alignments are used to assign the sequences to taxonomic and functional bins.
Why align against protein sequences? While analysis of microbiome sequences using DNA alignment against genomic references is feasible, there are a number of issues with this approach. First, currently, genomic reference databases cover only a small fraction of the diversity present in the environment (Wu et al., 2009 ). Second, the high level of redundancy of genomic sequences causes performance issues when query sequences display very large numbers of equally good alignments. Translated alignment ameliorates these issues to a degree because protein sequences are much more conserved than genomic sequences. Finally, a proper biological understanding of processes within a given microbiome requires detailed knowledge of the proteins present and their alignments to reference sequences of known function (Willmann et al., 2015 ).
The core computation of the approach presented here is the translated alignment of microbiome sequences against the NCBI-nr database (Benson, Karsch-Mizrachi, Lipman, Ostell, & Wheeler, 2005 ). In early microbiome studies (Huson et al., 2007 Poinar et al., 2006 Venter et al., 2004 ), BLASTX (Altschul et al., 1997 ) was used to align small numbers of reads (on the order of hundreds of thousands) against a small database (in 2007, NCBI-nr contained approximately 2 million sequences). For subsequent studies involving hundreds of millions of reads (Mackelprang et al., 2011 Qin et al., 2010 ), BLASTX was run at super-computer centers. Our lab developed DIAMOND (Buchfink, Xie, & Huson, 2015b ) to replace BLASTX in such analyses, providing a 20,000-fold speedup over BLASTX on short sequencing reads, while maintaining sufficient sensitivity. DIAMOND is used as the main alignment engine in a number of analysis pipelines (Franzosa et al., 2018 Huson et al., 2016 Zhu et al., 2017 ).
Analysis of short read microbiome samples usually involves determining the highest scoring alignments of each read to a set of reference sequences, followed by assignment to taxonomic and functional bins, using heuristics such as the naïve LCA (lowest common ancestor) approach for taxonomic binning (Huson et al., 2007 ) and the best hit approach for functional assignment (Huson, Mitra, Weber, Ruscheweyh, & Schuster, 2011 ). Long reads or assembled contigs require modified algorithms during alignment and binning, and both DIAMOND and MEGAN provide long read modes to operate on long (erroneous) sequences (Arumugam et al., 2019 Huson et al., 2018 ). Microbiome studies can be performed using either short read technology (Bentley, 2006 ), see e.g., (Willmann et al., 2015 ) or long read sequencing technology (Jain, Olsen, Paten, & Akeson, 2016 Rhoads & Au, 2015 ), see e.g., (Arumugam et al., 2019 ).
While short read sequences have lengths measured in hundreds of bases, long read technologies produce reads that are tens of kilobases (kb) in length, on average. Hence, one might assume that assembly is more useful for short reads than for long reads. Paradoxically, for microbiome sequences, this is not the case. For sequencing reads obtained from a mixed community of organisms, assembly of short reads usually leads to disappointingly low average scaffold lengths (a couple of kb Boisvert, Raymond, Godzaridis, Laviolette, & Corbeil, 2012 ), whereas the assembly of long reads can result in complete closed circular chromosomes (Arumugam et al., 2019 ). Thus, in this protocol, we follow two different strategies for taxonomic and functional classification of reads, one for short reads and one for long reads. For short reads, we follow a naïve read binning approach, whereas for long reads we first carry out an assembly and correction procedure and then perform the taxonomic and functional classification on the assembled contigs.
- 1. Alignment of all reads against a protein reference database using DIAMOND
- 2. Taxonomic and functional analysis of the resulting alignments using a program called MEGANIZER (part of the MEGAN package), or MEGAN
- 3. Interactive exploration and analysis using MEGAN
Both DIAMOND and MEGANIZER can be run either in short read mode (by default) or long read mode, so as to accommodate the two different types of input. In addition, in a preprocessing step, both types of sequences can be subjected to quality control and, especially prudent in the case of long reads, sequence assembly.
We provide two protocols. In the first protocol, we will discuss how to apply the core DIAMOND+MEGAN microbiome analysis pipeline in the context of a typical short read project. We will illustrate this protocol using a small, published dataset (Hu, Pang, Huang, Zhang, & Zhang, 2018 ) of 11 gut microbiome samples. In Support Protocol 1, we will discuss optional preprocessing. A general workflow of this process can be seen in Figure 1.
In the second protocol, we will discuss application of the core pipeline to long read data. Here, assembly of the reads in an initial preprocessing step is highly recommended, and is illustrated using the Unicycler pipeline. We will illustrate the necessary steps using a published dataset (Arumugam et al., 2019 ) collected from an enrichment bioreactor seeded with waste-water treatment sludge.
Protein sequence motifs, active or functional sites, and functional annotations
Perform secondary structure predictions on protein sequences.
Find binding specificity information about DNA-protein complexes.
Find information about the binding specificity of DNA-binding proteins.
Predict interacting partners and binding models.
Use this web-based sequence motif visualization system to display sequence motif information in its appropriate three-dimensional (3D) context.
Protein function prediction and annotation in an integrated environment powered by web service.
Find information about protein binding.
Use to predict function from de novo protein sequences.
Search for short active protein sequences with demonstrated biological activities.
Search for ungapped segments corresponding to the most highly conserved regions of proteins.
Identify and measure surface accessible pockets as well as interior inaccessible cavities, for proteins and other molecules.
To search for catalytic residue annotation for enzymes in the Protein Data Bank.
Predict protein function using Gene Ontology.
Automatically calculate evolutionary conservation scores of key amino acid residues and map them on protein structures.
Mine the protein structure space.
Predict short linear motifs (3-8 residues) in a set of protein sequences.
A web client for visualizing protein sequence feature information using DAS.
Identify the domain architecture within a protein sequence.
Predict enzyme catalytic site.
Predict functional sites in eukaryotic proteins.
Use a collection of tools for protein analyses.
Predict potential protein post-translational modifications and find potential single amino acid substitutions in peptides.
Search for information related to the catalytic mechanisms of enzymes.
An integrated feature-based function prediction server for vertebrate proteomes.
Identify the closest matching PRINTS sequence motif fingerprints in a protein sequence.
Search for functional annotation of important sites in proteins with known structures.
Produce 3D conformations of small drug compounds.
A database presenting experiment-based results in human proteomics.
Conduct exhaustive intermediate profile searches of a set of homologous protein sequences.
Design protein mutations in site-directed mutagenesis.
Use for protein functional site identification.
Annotate protein using this integrated annotation resource.
Identify protein family (and DNA) domains, patterns, motifs, protein families, and functional sites.
Interactive forecasting of protein interaction hot spots.
Discover long patterns in protein sequences.
Database containing pairs of structural analogs and their alignments.
Find sequence patterns in DNA and protein sequences.
A web server for knowledge-based modeling of protein-peptide complexes, specifically peptides in complex with major histocompatibility complex (MHC) proteins and kinases.
Find structural segments or motifs for protein structures.
Find motifs in a protein sequence.
Visualize protein sequence motifs on the 3D protein structures.
Find presence of any known protein motif (Prosite and Pfam) in a protein sequence.
Recognize spatial chemical binding patterns common to a set of protein structures.
Analyze proteins for the presence of N-terminal N-myristoylation site.
Find the presence of N-Glycosylation sites in human proteins.
Find the presence of O-GalNAc (mucin type) glycosylation sites in mammalian proteins.
Analyze eukaryotic proteins for the presence of serine, threonine and tyrosine phosphorylation sites.
Find possible kinase specific phosphorylation sites in eukaryotic proteins.
Predict cleavage sites at basic amino acid locations in neuropeptide precursor sequences.
Find information about patented nucleotide and protein sequences.
Search for information about glycoproteins with O-linked and C-linked glycosylation sites.
Find information about protein sequence annotations.
A server to predict protein active site residues.
Search for structural and functional information on the protein functional sites.
Search 3D protein fragments similar in structure to known active, binding and posttranslational modification sites.
Conduct genome wide functional and structural analysis.
Search for phosphorylation data of any protein of interest.
Search for information on prokaryotic proteins that undergo serine, threonine, or tyrosine phosphorylation.
Determine correct names for proteins.
Web application for predicting protein disorder by using physicochemical features and reduced amino acid set of a position-specific scoring matrix.
Find homologous protein-protein interactions across multiple species.
Search your query sequence against PROSITE pattern database for protein motifs.
Find information about protein-RNA complexes from the Protein Data Bank (PDB).
Search for protein fingerprints.
Identify protein families and domains for a given protein sequence.
A comprehensive database of pattern-recognition receptors and their ligands.
Search for short nucleotide or peptide sequences such as cis-elements in nucleotide sequences or small domains and motifs in protein sequences.
Database specialized in documenting human PPBD-containing proteins and PPBD-mediated interactions.
Predicts potential protease cleavage sites and sites cleaved by chemicals in a given protein sequence.
Predict combined transmembrane topology and signal peptides.
Search for eukaryotic phosphorylation sites.
Search for 3D structure and functional annotation of phosphorylation sites in proteins.
Search the database of in vivo phosphorylation sites of human and mouse proteins
Find information about polyglutamine (polyQ) repeats.
Find the presence of protein motifs and patterns in an amino acid sequence.
Predict signal peptide sequences and their cleavage positions in bacterial and eukaryotic amino acid sequences.
Predict protein functions based on known structures.
Predict the location of potential protein-protein binding sites for unbound proteins.
Identify short linear signatures in protein termini.
Analyze and identify newly obtained protein sequences.
Predict protein binding sites in a protein sequence based on geometrical analysis of protein tertiary substructures.
Search for evolutionarily conserved motif-like patterns in protein sequences.
Web-based server for analyzing and predicting RNA binding sites in proteins.
Search for similarities between proteins by simultaneous matching of multiple motifs.
Predict residues in protein sequences that determine the proteins' functional specificity.
Predict specificity-determining residues in protein families.
Find shared motifs in proteins with a common attribute.
Conduct in silico sumoylation sites prediction.
Detect protein sequence section under positive evolution selection.
Search for motifs and patterns within protein sequences.
Detect patterns, profiles and motifs in a protein sequence.
Search for motifs within proteins that are likely to be phosphorylated by specific protein kinases or bind to domains such as SH2 domains, 14-3-3 domains or PDZ domains.
Find information about populations carrying polymorphisms within protein binding pockets that make them susceptible to serious adverse drug reaction (SADR).
Search the presence of a motif in either amino acid sequence or nucleotide sequence.
Predict signal peptides and their cleavage sites.
Predict the presence of tyrosine sulfation sites in protein sequences
Look at protein structure from a ligand and binding site perspective.
Use a collection of bioinformatics tools at this portal site.
Find information about tandem repeats in proteins that carry fundamental biological functions and are related to a number of human diseases.
Find information about functional residues in alpha-helical and beta-barrel membrane proteins.
Database of domains and motifs with conservative location in transmembrane proteins.
Search for highly conserved and specific protein sequence motifs.
Predict functional sites in protein sequence alignments use different methodologies.
Find de novo protein motifs from chromatin immunoprecipitation data.
Scan query structures for functional sites in both proteins and nucleic acids.
Analyze quantitative structure-activity relationship of related protein families.
Search for ungapped alignments of highly conserved regions among a protein family or superfamily.
Predict the functional sites of proteins.
An expert system for predicting ligand-binding residues in protein structures.
Automatically identify spatially interacting motifs among distantly related proteins sharing similar folds and possessing common ancestral lineage.
The Health Sciences Library System supports the Health Sciences at the University of Pittsburgh.
© 1996 - 2014 Health Sciences Library System, University of Pittsburgh. All rights reserved.
Contact the Webmaster