Information

Solvent Accessibility, the 20% cut-off method


I'm reading the papers linked below and all three of them mention a 20% cut-off for buried/exposed residues, by calculating a relative solvent accessibility (RSA) value.

I understand how the RSA is calculated, by dividing the calculated solvent accessibility against its total solvent accessibility values from paper 4 table 2.

RSA = calculated/total

e.g. if arginine is calculated to have a solvent accessibility of 55.43 and its total solvent accessibility is 241 then the RSA = 55.43/241 = 23%, so this arginine is considered exposed (see statement 1 below).

What leaves me confused is the definition or lack of the 20% method for defining an exposed or buried residue.

I am assuming it means one of the following:

  1. If an amino acids RSA is below 20% it is buried and above 20% it is exposed. So for an amino acid with an RSA of 21% is considered exposed, this value seems a little low for me. I think statement 2 would make for sense.

  2. If an amino acids RSA is below 20% it is buried and above 80% it is exposed.

Which statement if any is correct?

Paper 1 - see methods section first paragraph

Paper 2 - see figure 5 and table 3

Paper 3 - see abstract and dataset

Paper 4 - see table 2 for total values


Its 1. Below cutoff, buried, above cutoff accessible.

Paper 1: "A cutoff of 20% was used to define the two states, buried or exposed. With this definition, the dataset was, roughly, evenly split between the two states."

Only two states are possible: solvent accessible and buried.

Paper 2: "A given residue is defined as exposed (e) if its RSA is larger than the cutoff value, and otherwise it is defined as buried (b)."

Abstract for Paper 3: a cutoff of 20% for two-state definition of solvent accessibility.

If it were def #2 it would be a three state definition.

Paper 4: "On average, 15% of residues in small proteins and 32% in larger ones may be classed as “buried residues”, having less than 5% of their surface accessible to the solvent… "

This paper, by the dean of structural analysis Cyrus Chothia, uses a 5% cutoff, not 20%…

The abstract goes on to say… "The accessibilities of most other residues are evenly distributed in the range 5 to 50%."

This passage hints that SA doesn't even go up to 80%. Just considering that you often won't get more than say 60% with this calculation. I'm just guessing; but the thought being that unless you are at a terminus of the protein, which is often disordered and doesn't show up in a crystal structure, you will have two adjoining amino acids for each residue - just the solvent accesible area taken up by the contact with neighbors could easily be 20% of total.


Are you sure that the RSA formula is right? I have found a different description:Relative solvent accessibility classes are usually derived from the DSSP program by normalizing it at the maximum value of exposed surface area obtainable for each residue.Different arbitrary threshold values of solvent accessibility are chosen to define binary categories (buried and exposed) or ternary categories (buried, partially exposed, or exposed).

Pollastri, G., Baldi, P., Fariselli, P., & Casadio, R. (2002). Prediction of coordination number and relative solvent accessibility in proteins. Proteins: Structure, Function, and Bioinformatics, 47(2), 142-153.


Prediction of protein solvent accessibility using support vector machines

A Support Vector Machine learning system has been trained to predict protein solvent accessibility from the primary structure. Different kernel functions and sliding window sizes have been explored to find how they affect the prediction performance. Using a cut-off threshold of 15% that splits the dataset evenly (an equal number of exposed and buried residues), this method was able to achieve a prediction accuracy of 70.1% for single sequence input and 73.9% for multiple alignment sequence input, respectively. The prediction of three and more states of solvent accessibility was also studied and compared with other methods. The prediction accuracies are better than, or comparable to, those obtained by other methods such as neural networks, Bayesian classification, multiple linear regression, and information theory. In addition, our results further suggest that this system may be combined with other prediction methods to achieve more reliable results, and that the Support Vector Machine method is a very useful tool for biological sequence analysis.


Introduction

A palindrome refers to a set of characters in a sequence that reads the same in both directions. Palindromes are present in nucleic acid and protein sequences. Nearly, 30% residues in a protein are members of peptide palindromes, tripeptidic and longer [1]. Palindromes exceeding 10 residues in length are not rare [2]. As the length of the palindrome sequence decreases, more number of palindromes is known to occur in proteins [3]. 26% protein sequences in the SwissProt database comprise at least one palindromic repeat [4]. Palindrome sequences have a high tendency to form α-helices [5]. Generally, the roles of palindrome in protein, is not clear.

In the present study, we have analyzed certain sequence and structural properties associated with palindromes in proteins, such as, probability of amino acid residue occurrence at individual positions in the palindrome sequences of specific length, secondary structure conformation, hydrophobicity, solvent accessibility, residue neighborhood contacts, interaction with catalytic site or active site residues, ligand or metal in proteins and identifying protein families comprising the palindromes. We discuss these features for pentapeptide and large palindromes identified in representative proteins of known three-dimensional structure. Further, we examine for certain illustrative examples, the 𠆎nvironment’ of palindromes characterized by the same length, sequence and secondary structure in different proteins.


Results

Dataset of bound and unbound structures

The dataset consists of 126 protein-RNA complexes for which at least one interacting partner is available in the unbound form. Of these 126 complexes, 28 are in class A, 5 are in class B, 40 are in class C and 53 are in class D (refer to Materials and Methods section and Supplementary Table S1). Based on their availability in the unbound form, we find 21 are of PURU type, where both the protein and the RNA are available in the unbound form, 95 are of PURB type, where only the protein is available in the unbound form, and 10 are of PBRU type, where only the RNA is available in the unbound form (Table 1). Local alignment of the polypeptide chains between the unbound and the bound (U/B) structures reveals that 93 out of 116 have sequence identity >98%, while the rest have values between 90% and 98%. On the other hand, sequence identity of 20 out of 31 U/B pairs of polynucleotides have values >98% and the rest have values between 90% and 98%. We have discarded 896 (

6.7%) nucleotides in the entire dataset due to the mismatch in the alignment between U/B pairs.

Change in accessibility at the protein-RNA interfaces

The overall change in accessibility upon binding is a cumulative effect of many local conformational rearrangements. Some residues get exposed by burying the others or vice-versa. Change in accessibility of the interface atoms upon binding was calculated by comparing their SASA values in bound and unbound states. On an average, RBPs gain 120.5 Å 2 of solvent accessibility at the interface upon binding with RNA (Table 1). We find in 92 out of 116 cases, interface region of RBPs gain accessibility upon binding with an average (delta _

^>) (refer to Materials and Methods section) of −172.0 Å 2 . In the remaining 24 cases, positive changes in (delta _

^>) are observed with an average of 77.2 Å 2 , indicating a loss in accessibility at the interface. On an average, interface region of RNAs gain 92.5 Å 2 of solvent accessibility upon binding with RBPs. Majority of them, 80% (25 out of 31), show negative (delta _^>) with an average of −144.1 Å 2 (Table 1). Remaining, only 20%, show positive changes with an average of 122.7 Å 2 , indicating a loss of accessibility.

Distribution of change in accessibility in RBPs and RNAs upon binding. (A) Correlation between |∆AP| and |∆AR| at the protein-RNA interfaces for 21 UU cases. The different classes of complexes are shown in different symbols. Distributions of δA in 116 RBPs and in 31 RNAs at the protein-RNA (B) interface and (C) non-interface regions.

Change in accessibility at the non-interface region

We have estimated the change in accessibility of amino acid residues and nucleotides at the non-interface region. Here, the average change in accessibility of RBPs is only 3.4 Å 2 (Table 1), which is significantly lower than that of the interface region. In the entire dataset, 50% RBPs show negative changes with an average of −24.6 Å 2 , and 50% show positive changes with an average of 30.4 Å 2 . At the non-interface region of the RNA, the average change in accessibility is 40.3 Å 2 . In the entire dataset, majority (71%) of the RNAs lose accessibility upon binding with an average (delta _^ >< m>< m>< m>>) of 67.3 Å 2 . Only nine RNAs (29%) show negative changes with an average of −25.6 Å 2 .

Distribution of δA in main chain and side chain calculated on 116 RBPs (A), and in phosphate, sugar and bases calculated on 31 RNAs (B). The average values are presented for Buried (Bu) and Exposed (Ex) surfaces in interface and in non-interface regions of different class of complexes.

Effect of conformational change on accessibility

Conformational changes between unbound and bound forms are estimated in terms of i-rmsd, which is the root mean squared deviation of interface Cα and P atoms of amino acids and nucleotides, respectively. Based on the degree of conformational changes, the protein-RNA binding can be classified into rigid body (i-rmsd < 1.5 Å), semi-flexible (i -rmsd within 1.5 Å to 3.0 Å) and full flexible (i-rmsd > 3.0 Å) 11,13 . Although we find the average change in (delta _

^>) is −96 Å 2 and −100.4 Å 2 for rigid-body and semi-flexible bindings, respectively, the change is significantly higher (−248 Å 2 ) for full flexible binding. We find a moderate correlation (R = 0.6) between (delta _

^>) and i-rmsd. Besides, we also find the change in interface accessibility is significantly contributed by the side chain conformations (Fig. 2A), which is ignored in i-rmsd calculation. This can be exemplified in Fig. 3A,B, where the tRNA splicing endonuclease undergoes rigid body association (i-rmsd is 1.0 Å), however, its interface shows a significant change in accessibility ( (delta _

^>) is −410.7 Å 2 ) upon binding with its partner RNA. Here, the side chain ( (delta _

^>) is −356 Å 2 ) accounts for the large change in accessibility than its main chain ( (delta _

^>) is −54.6 Å 2 ). Counter examples are also observed, where the small change in interface accessibility does not correlate with the high i-rmsd values. This is exemplified in ribosomal L1 protein, which undergoes significant conformational change (i-rmsd is 5.1 Å) upon binding with its partner RNA even though the change in accessibility is only −2.2 Å 2 . The N- and C-terminal domains of L1 are linked by a short and a long loop (Fig. 3C). In the unbound form, the buried surface area between these two domains is very small. Upon binding with RNA, the long loop acts as a hinge and moves both the domains apart to facilitates the RNA binding. This domain movement leads to higher i-rmsd without affecting the overall change in accessibility. Similarly, changes in accessibility may also be attributed to the backbone as well as to the conformational changes of sugar and bases of RNA. For an instance, E. coli Ras-like protein (ERA), which acts as a chaperone for folding and maturation of 16S rRNA induces a large conformational change in 12-nucleotides long 3′-end of 16 S rRNA. The RNA adopts a Z-like structure upon binding with the KH domain of ERA 14 , and the estimated (delta _^>) is −311.7 Å 2 . The second U from the 5′-end of the 12-nucleotides sequence changes the conformation of the base (anti-to-syn) and the sugar pucker (C2′-endo-to-C3′-endo), and contributes −96.5 Å 2 change in accessibility (Fig. 3D).

Change in accessibility on local and global conformational change. (A) Superposed structures of RNA splicing endonuclease in bound 42 (in orange, PDB id: 2GJW) and in unbound 43 (in cyan, PDB id: 1R0V) conformations with the RNA (shown in grey). Arg-nucleotide-Arg sandwich at the cleavage site of the nuclease is shown. Both the Arg are labeled and shown in stick. Change in conformation of R302 allows A13 (in blue) to protrude into the endonuclease pocket and stacked by the two Arg. (B) Top view of the aforementioned structure in sphere representation. Both the Arg are shown in yellow. In unbound structure, the endonuclease pocket is not accessible to the nucleotide. Change in conformation of R302 makes the pocket more accessible. (C) Unbound 44 (PDB id: 1AD2) and bound 45 (PDB id: 2HW8) structures of ribosomal protein L1 (in cyan). The loop at the hinge region connecting two domains is colored in red. RNA molecule in the bound structure is shown in grey cartoon. (D) Superposed structures of unbound (PDB id: 1SDR, in yellow) and bound (PDB id: 3IEV, in grey) forms of 12-nucleotides long 3′-end of 16 S rRNA with ERA. Protein is represented in orange cartoon.

Changes in secondary structural elements in RBPs upon binding

Conformation changes can alter the secondary structures during unbound to bound transition leading to the change in accessibility. Figure 4A shows the average |ΔAP| accounts for different types of transitions in the secondary structural elements upon binding. We find the average change in accessibility at the interface is highest (|ΔAP| = 47.5 Å 2 ) in transitions from loop-to-helix followed by in transitions from helix-to-loop (|ΔAP| = 41 Å 2 ) and from loop-to-sheet (|ΔAP| = 38.4 Å 2 ). Figure 4B shows an example of loop-to-helix transition where the unstructured α1-helix of L25 protein in the unbound state (PDB id: 1B75) adopts the helical conformation upon binding with the major groove of 5 s rRNA (PDB id: 1DFU) 15 . The α1-helix loses 230 Å 2 upon binding with its partner RNA. We did not find any transition from helix-to-sheet or vice-versa at the interface.

Changes in |ΔA| due to the transitions of secondary structural elements in RBPs upon binding with RNA. (A) Average |ΔA| calculated per transition is presented for both interface and non-interface regions. (B) A loop-to-helix transition. Here, the α1-helix of L25 (Lys14 to Ala23, coloured in red) is unstructured in the unbound state (PDB id: 1B75), which adopts a helical conformation upon binding with the major groove of 5 s rRNA (PDB id: 1DFU). (C) A sheet-to-helix transition. Here, Arg57 and Ala58 (shown in red stick) of translation elongation factor EF-Tu are in sheet conformation in the unbound state (PDB id: 1TUI), which adopt helical conformations upon binding with the tRNA (Cys) (PDB id: 1B23). (D) Another example of sheet-to-helix transition. Here, Ala85 and Val86 (shown in red stick) of CCA-adding enzyme are in β-sheet conformations in the unbound state (PDB id: 1UET) of the enzyme, which adopt α-helical conformations upon binding with the t-RNA (PDB id: 2DRB). In all these figures, the protein in bound and unbound states is shown in orange and teal, respectively, and the RNA is shown in grey.

At the non-interface region, the highest change in accessibility is observed in transitions from sheet-to-helix (|ΔAP| = 64.9 Å 2 ). This change is observed in the following four residues from two different RBPs. Two residues, Arg57 and Ala58 in translation elongation factor EF-Tu (PDB id: 1TUI), undergo sheet-to-helix transitions upon binding with the tRNA(Cys) (PDB id: 1B23) (Fig. 4C). The other two residues, Ala85 and Val86 in the unbound state of the CCA-adding enzyme (PDB id: 1UET), undergo sheet-to-helix transitions upon binding with the tRNA (PDB id: 2DRB) (Fig. 4D). Loop-to-helix transitions also contributes significantly to the change in accessibility (average |ΔAP| = 34.3 Å 2 ) at the non-interface regions, whereas, transitions from helix-to-loop or loop-to-sheet contribute moderately.

The effect of intermolecular H-bonds on accessibility

We evaluate the effect of intermolecular H-bonds on the change in solvent accessibility of amino acid residues and nucleotides at the protein-RNA interfaces. We find the change in accessibility is significant for the residues that are not involved in any H-bond with the partner nucleotides across the interfaces as compared to those involved in H-bond (Fig. 5A). This trend is observed in the entire dataset as well as among the different classes. The average |δAP| is 61.3 Å 2 for residues involved in H-bonds across the interface, whereas, those do not participate in H-bonds have an average of 93 Å 2 .

Distribution of δA in main chain and side chain calculated on 116 RBPs (A), and in phosphate, sugar and bases calculated on 31 RNAs (B). The average values are presented for Buried (Bu) and Exposed (Ex) surfaces of different class of complexes. Values for both H-bond (HB) and non-H-bond (Non HB) residues are given. Propensities of (C) amino acid residues and (D) nucleotides to get exposed or buried upon binding.

At the RNA side, the change in accessibility is significantly higher for nucleotides that do not involved in any H-bond compared to those involved in H-bond across the interface (Fig. 5B). This phenomenon is observed in the entire dataset as well as among the different classes. Interestingly, different trend is observed in |δAR| among phosphate, sugar and bases. Among those involved in H-bonds across the interface, the highest change in average |δAR| is observed in bases (38.3 Å 2 ), followed by phosphate (32.5 Å 2 ) and sugar (14.4 Å 2 ). On the contrary, those do not participate in any H-bonds across the interface, the highest change in average |δAR| is observed in bases (183.7 Å 2 ), followed by sugar (163 Å 2 ) and phosphate (83.5 Å 2 ).

Accessibility of residues and nucleotides upon binding

The propensity of amino acid residues to get buried or exposed upon binding is shown in Fig. 5C. Upon binding, a positive propensity signifies that the residue prefers to get exposed while a negative propensity indicates their preference to get buried. Among the positively charged residues, Arg shows little preference to get buried both at the interface and at the non-interface regions, while, Lys shows opposite trend at both the regions. Among the negatively charged residues, Asp shows strong preference to get buried at the interface, while, Glu shows similar preference at the non-interface region, but with a lesser extent. Between Asn and Gln, the former shows preference to get exposed only at the non-interface, while the later shows preference to get buried both at the interface and at the non-interface regions. Among the neutral polar residues, His and Thr prefer to get exposed, whereas, Ser prefers to get buried both at the interface and at the non-interface regions. Among the three aromatic residues, Tyr and Phe both prefer to get exposed at the interface with a different magnitude, while Trp prefers to get buried at the interface and get exposed at the non-interface. Both the sulphur containing residues, Cys and Met, prefer to get buried both at the interface and the non-interface regions, however, with a different magnitude the former have stronger preference than the later. Among the hydrophobic residues, Leu, Val and Ala prefer to get exposed both at the interface and the non-interface regions, while Gly prefers to get exposed only at the interface. On the contrary, Pro prefers to get buried both at the interface and the non-interface regions. Ile behave differently, it prefers to get buried at the interface and get exposed at the non-interface regions.

Among the four nucleotides, adenine and cytosine prefer to get buried at the interface and get exposed at the non-interface regions. Guanine prefers to get buried, while uracil prefers to get exposed both at the interface and at the non-interface regions (Fig. 5D).

Change in SASA can be used as a parameter to score protein-RNA decoys

Binding induced conformational transitions lead to change in SASA of individual atoms in interacting subunits. Few of the atoms gain accessible surface and few lose. We find the average gain to loss ratio of accessible surface area (GL ratio) upon binding is 1.7 and 1.0 (p-value = 1.6E-04, single tailed t-test) at the interface and at the non-interface regions, respectively. In majority of the cases, the ratio is close to one at the non-interface region. This ratio has never been used in any available protein-RNA docking algorithms 16 , and may be efficiently use to score the flexible docking models to identify the near native solution. Figure 6A and 6B shows the distribution of the GL ratio in 115 RBPs and in 31 RNAs, respectively. The highest GL ratio (18.7) is found in the structure of iron regulatory protein 1 (IRP1) in complex with ferritin H IRE RNA (PDB id: 3SNP). This high ratio can be attributed to the large conformational change in IRP1 upon binding with the RNA, which is facilitated by a major rearrangement of the two domains of IRP1 17 (Fig. 6C), gaining 1279 Å 2 accessibility at the interface. The lowest GL ratio (0.5) is observed in complex between poly(A) polymerase and oligo(A) RNA (PDB id: 2Q66). In the polymerase, the catalytic site is located at the bottom of the cleft between N- and C-terminal domains of the polymerase 18 . In the unbound state, both the domains of the polymerase remain in open conformation and adopt closed conformation upon binding with the RNA, thereby losing 163.6 Å 2 surface area at the interface (Fig. 6D). The highest GL ratio (2.8) at the RNA binding surface is observed in the T-arm analogue RNA segment (PDB id: 1EVV) in complex with 5-methyluridine methyltransferase TrmA (PDB id: 3BT7). In the unbound state, U54 remains buried inside the T-loop of the tRNA and forms a reverse-Hoogsteen base pair with A58 19 . In the bound state, the loop changes its conformation and U54 flips out towards the active site of the enzyme, thereby gaining the surface accessibility of 310.4 Å 2 (Fig. 6E).

Gain or loss in accessibility. (A) The distribution of GL ratio of RBPs at interface and non-interface regions. (B) The distribution of GL ratio of RNAs at interface and non-interface regions. (C) In the unbound state of IRP1(PDB id: 2B3Y), domain 3 and 4 are in closed conformation, which transformed into open conformation upon binding with the RNA (PDB id: 3SNP). Both the domains move apart (bidirectional arrow), thereby increasing significant amount of surface to accommodate the RNA. Domain 3 and 4 are colored in blue and orange respectively, and the rest of the protein is colored in teal. (D) Example of “open-to-close” conformation change in poly(A) polymerase and oligo(A) RNA complex (PDB id: 2Q66). In the unbound state (color teal PDB id: 2HHP), the binding cleft between N- and C-terminal domains remain wide open, which transformed into closed conformation upon binding with the RNA, hence losing the accessibility. (E) Superposed T-arm analogue RNA segment in bound (in grey PDB id: 3BT7) and in unbound (in yellow PDB id: 1EVV) states. The U54 (in magenta) in the unbound state remains inside the loop, which flips out to the active site upon binding with the 5-methyluridine methyltransferase TrmA (shown in orange).


3 Results

3.1 Features

We used a series of features to set up the SOLart solubility predictor, which are described below.

3.1.1 Statistical potentials

We applied and extended the solubility-dependent statistical potentials recently introduced in Hou et al. (2018), which have proven to yield an objective and informative description of the interactions that modulate protein solubility properties. The idea was to divide the dataset D E . coli into two subsets of equal size, called D E . coli insol and D E . coli sol ⁠ , which contain aggregation-prone and soluble proteins, respectively, and to derive distance potentials from each of the two subsets (see Hou et al., 2018 for details). In this way, we defined two distinct potentials referred to as ‘insoluble’ and ‘soluble’.

The analysis of these potentials led to detect the tendency of certain amino acid interactions such as Lys-containing salt bridges and aliphatic interactions to favor protein solubility. In contrast, residue interactions involving delocalized π-electrons such as aromatic and cation-π interactions have been shown to promote protein aggregation ( Hou et al., 2018).

We constructed 11 solubility-dependent statistical potentials from different combinations of s and c elements, listed in Table 2. We named the potentials according to the type and number of sequence and structure descriptors. For example, ‘sa’ represents the potential in which one amino acid type and one solvent accessibility are specified, whereas ‘sds’ describes the potential in which two amino acid types and their inter-residue distance are given.

List of all the features tested for SOLart

Features . Description . SOLart .
Statistical potentials
sd: Δ Δ G sd 1 amino acid, 1 distance ✓✓
sds: Δ Δ G sds 2 amino acids, 1 distance ✓✓
sa: Δ Δ G sa 1 amino acid, 1 solvent accessibility ✓✓
saa: Δ Δ G saa 1 amino acid, 2 solvent accessibilities ✓✓
ssa: Δ Δ G ssa 2 amino acids, 1 solvent accessibility ✓✓
st: Δ Δ G st 1 amino acid, 1 torsion angle domain ✓✓
stt: Δ Δ G stt 1 amino acid, 2 torsion angle domains ✓✓
sst: Δ Δ G sst 2 amino acids, 1 torsion angle domain ✓✓
sad: Δ Δ G sad 1 amino acid, 1 distance and 1 solvent accessibility ✓✓
std: Δ Δ G std 1 amino acid, 1 distance and 1 torsion angle domain ✓✓
sta: Δ Δ G sta 1 amino acid, 1 distance and 1 solvent accessibility ✓✓
Protein size and solvent accessible surface area
Λ protein length ✓✓
SAcc protein solvent accessibility ✓✓
SAcc/Λ protein solvent accessibility divided by length ✓✓
Secondary structure content
β _b fraction of buried β residues ✓✓
β_m fraction of moderately buried β residues ✓✓
β_e fraction of exposed β residues
α_b fraction of buried α residues
α_m fraction of moderately buried α residues ✓✓
α_e fraction of exposed α residues ✓✓
γ_b fraction of buried coil residues
γ_m fraction of moderately buried coil residues
γ_e fraction of exposed coil residues
Amino acid composition
C i ( i = 1..20 ) fraction of each of the 20 amino acid types
K+R fraction of positively charged residues
K−R fraction of K minus fraction of R ✓✓
D+E fraction of negatively charged residues ✓✓
D−E fraction of D minus fraction of E
K+R+D+E fraction of charged residues ✓✓
K+R-D-E fraction of positively minus negatively charged residues ✓✓
F+W+Y fraction of aromatic residues ✓✓
_b, m, e idem with distinction between buried, moderately buried and exposed residues
Features . Description . SOLart .
Statistical potentials
sd: Δ Δ G sd 1 amino acid, 1 distance ✓✓
sds: Δ Δ G sds 2 amino acids, 1 distance ✓✓
sa: Δ Δ G sa 1 amino acid, 1 solvent accessibility ✓✓
saa: Δ Δ G saa 1 amino acid, 2 solvent accessibilities ✓✓
ssa: Δ Δ G ssa 2 amino acids, 1 solvent accessibility ✓✓
st: Δ Δ G st 1 amino acid, 1 torsion angle domain ✓✓
stt: Δ Δ G stt 1 amino acid, 2 torsion angle domains ✓✓
sst: Δ Δ G sst 2 amino acids, 1 torsion angle domain ✓✓
sad: Δ Δ G sad 1 amino acid, 1 distance and 1 solvent accessibility ✓✓
std: Δ Δ G std 1 amino acid, 1 distance and 1 torsion angle domain ✓✓
sta: Δ Δ G sta 1 amino acid, 1 distance and 1 solvent accessibility ✓✓
Protein size and solvent accessible surface area
Λ protein length ✓✓
SAcc protein solvent accessibility ✓✓
SAcc/Λ protein solvent accessibility divided by length ✓✓
Secondary structure content
β _b fraction of buried β residues ✓✓
β_m fraction of moderately buried β residues ✓✓
β_e fraction of exposed β residues
α_b fraction of buried α residues
α_m fraction of moderately buried α residues ✓✓
α_e fraction of exposed α residues ✓✓
γ_b fraction of buried coil residues
γ_m fraction of moderately buried coil residues
γ_e fraction of exposed coil residues
Amino acid composition
C i ( i = 1..20 ) fraction of each of the 20 amino acid types
K+R fraction of positively charged residues
K−R fraction of K minus fraction of R ✓✓
D+E fraction of negatively charged residues ✓✓
D−E fraction of D minus fraction of E
K+R+D+E fraction of charged residues ✓✓
K+R-D-E fraction of positively minus negatively charged residues ✓✓
F+W+Y fraction of aromatic residues ✓✓
_b, m, e idem with distinction between buried, moderately buried and exposed residues

Note: Those used in the final version are marked by a ✓✓ those for which a subset is used are marked by a .

List of all the features tested for SOLart

Features . Description . SOLart .
Statistical potentials
sd: Δ Δ G sd 1 amino acid, 1 distance ✓✓
sds: Δ Δ G sds 2 amino acids, 1 distance ✓✓
sa: Δ Δ G sa 1 amino acid, 1 solvent accessibility ✓✓
saa: Δ Δ G saa 1 amino acid, 2 solvent accessibilities ✓✓
ssa: Δ Δ G ssa 2 amino acids, 1 solvent accessibility ✓✓
st: Δ Δ G st 1 amino acid, 1 torsion angle domain ✓✓
stt: Δ Δ G stt 1 amino acid, 2 torsion angle domains ✓✓
sst: Δ Δ G sst 2 amino acids, 1 torsion angle domain ✓✓
sad: Δ Δ G sad 1 amino acid, 1 distance and 1 solvent accessibility ✓✓
std: Δ Δ G std 1 amino acid, 1 distance and 1 torsion angle domain ✓✓
sta: Δ Δ G sta 1 amino acid, 1 distance and 1 solvent accessibility ✓✓
Protein size and solvent accessible surface area
Λ protein length ✓✓
SAcc protein solvent accessibility ✓✓
SAcc/Λ protein solvent accessibility divided by length ✓✓
Secondary structure content
β _b fraction of buried β residues ✓✓
β_m fraction of moderately buried β residues ✓✓
β_e fraction of exposed β residues
α_b fraction of buried α residues
α_m fraction of moderately buried α residues ✓✓
α_e fraction of exposed α residues ✓✓
γ_b fraction of buried coil residues
γ_m fraction of moderately buried coil residues
γ_e fraction of exposed coil residues
Amino acid composition
C i ( i = 1..20 ) fraction of each of the 20 amino acid types
K+R fraction of positively charged residues
K−R fraction of K minus fraction of R ✓✓
D+E fraction of negatively charged residues ✓✓
D−E fraction of D minus fraction of E
K+R+D+E fraction of charged residues ✓✓
K+R-D-E fraction of positively minus negatively charged residues ✓✓
F+W+Y fraction of aromatic residues ✓✓
_b, m, e idem with distinction between buried, moderately buried and exposed residues
Features . Description . SOLart .
Statistical potentials
sd: Δ Δ G sd 1 amino acid, 1 distance ✓✓
sds: Δ Δ G sds 2 amino acids, 1 distance ✓✓
sa: Δ Δ G sa 1 amino acid, 1 solvent accessibility ✓✓
saa: Δ Δ G saa 1 amino acid, 2 solvent accessibilities ✓✓
ssa: Δ Δ G ssa 2 amino acids, 1 solvent accessibility ✓✓
st: Δ Δ G st 1 amino acid, 1 torsion angle domain ✓✓
stt: Δ Δ G stt 1 amino acid, 2 torsion angle domains ✓✓
sst: Δ Δ G sst 2 amino acids, 1 torsion angle domain ✓✓
sad: Δ Δ G sad 1 amino acid, 1 distance and 1 solvent accessibility ✓✓
std: Δ Δ G std 1 amino acid, 1 distance and 1 torsion angle domain ✓✓
sta: Δ Δ G sta 1 amino acid, 1 distance and 1 solvent accessibility ✓✓
Protein size and solvent accessible surface area
Λ protein length ✓✓
SAcc protein solvent accessibility ✓✓
SAcc/Λ protein solvent accessibility divided by length ✓✓
Secondary structure content
β _b fraction of buried β residues ✓✓
β_m fraction of moderately buried β residues ✓✓
β_e fraction of exposed β residues
α_b fraction of buried α residues
α_m fraction of moderately buried α residues ✓✓
α_e fraction of exposed α residues ✓✓
γ_b fraction of buried coil residues
γ_m fraction of moderately buried coil residues
γ_e fraction of exposed coil residues
Amino acid composition
C i ( i = 1..20 ) fraction of each of the 20 amino acid types
K+R fraction of positively charged residues
K−R fraction of K minus fraction of R ✓✓
D+E fraction of negatively charged residues ✓✓
D−E fraction of D minus fraction of E
K+R+D+E fraction of charged residues ✓✓
K+R-D-E fraction of positively minus negatively charged residues ✓✓
F+W+Y fraction of aromatic residues ✓✓
_b, m, e idem with distinction between buried, moderately buried and exposed residues

Note: Those used in the final version are marked by a ✓✓ those for which a subset is used are marked by a .

3.1.2 Protein size and accessible surface area

We considered three global characteristics of the proteins, which are the protein length (Λ), its solvent accessible surface area (SAcc) estimated with an in-house program ( Dalkas et al., 2014), and its solvent accessible surface area divided by the protein length (SAcc/Λ) in the latter case, we used the length of the sequence whose structure has been determined. Note that the former feature is sequence-based, and that the latter two require the knowledge of the 3D structure.

3.1.3 Secondary structure content

Another series of structure-based features were added, which are the fraction of protein residues that are in α-helical, β-strand or coil (called here γ) conformation. We distinguished between the α, β and γ residues that are buried in the protein core (solvent accessibility ≤ 20 % ⁠ ), moderately buried (between 20% and 50%), and solvent exposed ( ⁠ ≥ 50 % ⁠ ). Our in-house program ( Dalkas et al., 2014) was used to assign the secondary structure and solvent accessibility.

3.1.4 Amino acid composition

We integrated 20 purely sequence-based features, corresponding to the fraction of each of the 20 amino acid present in a protein. We also considered the fraction of amino acid groups, i.e. positively charged residues (K+R), negatively charged residues (D+E), charged residues (K+R+D+E) aromatic residues (F+W+Y), as well as the difference between the fractions of K and R (K−R), D and E (D−E), and K+R and D+E (K+R−D−E). We combined these features with the solvent accessibility and defined three categories per amino acid or amino acid group, according to whether the residue is exposed, moderately buried or buried. This yielded 81 additional structure-based features.

3.2 Feature selection

The next step consisted in selecting, out of the above-defined 28 purely sequence-based features and 103 structure-based features, the subset of features that are the most informative for protein solubility. We used for that purpose the D E . coli training set, which contains 406 non-redundant high-resolution X-ray structures of E.coli proteins with low pairwise sequence identity and experimentally measured solubility (see Section 2.2). The feature selection was performed using the Boruta algorithm ( Kursa et al., 2010) implemented in the Caret package of R ( Kuhn et al., 2008), a wrapper built around the random forest classification algorithm ( Liaw et al., 2002), which compares the importance of the real features with those of random (shadow) features using statistical testing. The results are obtained as an average over several runs (here 1000) of random forest.

We filtered out the features whose average importance measured by the Boruta algorithm is lower than 1. This led us to keep a total of 52 features, which are shown in Figure 1 and Supplementary Figure S2 . Among these, 37 require the knowledge of the structure.

The top 30 most important features identified by feature selection, from left to right. The names in lowercase letters indicate folding free energy differences, e.g. sst means Δ Δ G sst

The top 30 most important features identified by feature selection, from left to right. The names in lowercase letters indicate folding free energy differences, e.g. sst means Δ Δ G sst

Strikingly, the four top-ranked features are folding free energy differences Δ Δ G computed from our solubility-dependent potentials: the backbone torsion angle potential sst, the solvent accessibility potential ssa and the two distance potentials sd and sds (see Table 2). The next most important feature is the protein length Λ, followed by the solvent accessibility and fractions of some amino acid types. The features based on the secondary structure do not appear among the 30 top features, but some appear in the list of 52 selected features.

3.3 Setting up SOLart

The 52 selected features were combined to set up the SOLart predictor of the solubility of target proteins on the basis of their 3D structures. We used for that purpose D E . coli as training set, and the random forest regression algorithm ( Liaw et al., 2002) implemented in the Caret package to construct the model. This algorithm is a tree-based system composed of multiple regression trees the number of trees is here set to 500. The training process starts with a randomly selected subset of the original dataset from which a regression tree is constructed by the iterative partitioning of the data space into smaller subsets. At each node of the tree, randomly sampled features are used the number of features depends on a global parameter ‘mtry’ taken here between 1 and 52, the total number of features. The optimal mtry value is obtained through a grid search procedure. Its impact on the prediction performance is illustrated in Supplementary Figure S5 . The regression for a target protein is obtained by averaging the predictions over all trees.

3.4 Performance of SOLart

As the prediction model is constructed on the basis of the selected features but also depends on the mtry parameter value, we performed nested 10-fold cross-validation to assess the performance of SOLart on the D E . coli set, with an outer cross-validation loop and an inner cross-validation loop nested in the outer loop, as explained in Supplementary Section S4 . A total of 30 replicates were performed for the outer loop cross-validation, with different random divisions into folds, and the performances were computed as averages over the replicates.

Our computational model reaches a good linear correlation coefficient of r = 0.66 between the SOLart solubility predictions and the experimental values, and a root mean square error, RMSE = 25 % ( Table 3).

SOLart performances in cross-validation on the learning set D E . coli ⁠ , and on three independent test sets: D S . cerevisiae containing X-ray structures and M E . coli and M S . cerevisiae containing modeled structures

. D E . coli . M E . coli . D S . cerevisiae . M S . cerevisiae .
r0.66 0.51 (0.67) 0.67 (0.78) 0.63 (0.70)
RMSE 25% 28% (23%) 23% (19%) 24% (20%)
. D E . coli . M E . coli . D S . cerevisiae . M S . cerevisiae .
r0.66 0.51 (0.67) 0.67 (0.78) 0.63 (0.70)
RMSE 25% 28% (23%) 23% (19%) 24% (20%)

Note: The values in parentheses correspond to the performance with 10% outliers removed.

SOLart performances in cross-validation on the learning set D E . coli ⁠ , and on three independent test sets: D S . cerevisiae containing X-ray structures and M E . coli and M S . cerevisiae containing modeled structures

. D E . coli . M E . coli . D S . cerevisiae . M S . cerevisiae .
r0.66 0.51 (0.67) 0.67 (0.78) 0.63 (0.70)
RMSE 25% 28% (23%) 23% (19%) 24% (20%)
. D E . coli . M E . coli . D S . cerevisiae . M S . cerevisiae .
r0.66 0.51 (0.67) 0.67 (0.78) 0.63 (0.70)
RMSE 25% 28% (23%) 23% (19%) 24% (20%)

Note: The values in parentheses correspond to the performance with 10% outliers removed.

We also tested SOLart on an independent test set that contains S.cerevisiae proteins with a well resolved X-ray structure, grouped in the D S . cerevisiae set (see Section 2.2). The performance of SOLart on this set is evaluated by a linear correlation coefficient r = 0.67 and an RMSE = 23 % ⁠ . When 10% outliers are removed, the score increases up to r = 0.78 and RMSE = 19 % ( Table 3). The scores on this independent set are thus even slightly better than those obtained in cross-validation on the training set D E . coli ⁠ .

To further analyze this result, we estimated the importance of each feature in the SOLart prediction using the varImp permutation scheme-based function ( Kuhn et al., 2008). It proceeds by randomly permuting each feature in turn in order to break its association with the response, and then using it together with the remaining unpermuted features for prediction. The decrease of the prediction accuracy is a measure of the importance of the permuted feature. This measure estimates the weight of each individual feature in the predictor, whereas the feature selection algorithm applied in Section 3.2 measures the feature relevance independently of the prediction model. They thus yield slightly different rankings.

The 20 most important features of our prediction model are shown in Figure 2 (see also Supplementary Fig. S3 ). Interestingly, almost all the features that correspond to folding free energy differences ( ⁠ Δ Δ G ⁠ ) are in this list (9 out of 11), and the six top features are the Δ Δ G s computed from the potentials ssa, sst, sd, sds, saa and sa ( Table 2). The two best ones, almost ex æquo, are Δ Δ G ssa and Δ Δ G sst ⁠ , which also ranked first in the feature selection ( Fig. 1). They are computed from the propensities of amino acid pairs to be associated with a certain solvent accessibility range a or a certain backbone torsion angle domain t of a residue. These propensities differ between soluble and aggregation-prone proteins, and it is this difference which is measured through the Δ Δ G features. The next best ranked features are Δ Δ G sd and Δ Δ G sds ⁠ , computed from the propensities of residue pairs to be separated by a certain spatial distance, followed by two other accessibility potentials Δ Δ G saa and Δ Δ G sa ⁠ .

The top 20 most important features of SOLart, from right to left. The names in lowercase letters indicate folding free energy differences, e.g. sst means Δ Δ G ssa

The top 20 most important features of SOLart, from right to left. The names in lowercase letters indicate folding free energy differences, e.g. sst means Δ Δ G ssa

These folding free energy features require the protein structure as input. In fact, more than half of the top 20 features are structure-based, which confirms the relevance of structural information for determining protein solubility properties. The first sequence-based feature ranks seventh. It is the sequence length Λ: in general, the smaller the sequence, the most soluble the protein ( Kramer et al., 2012). The two related features, i.e. the solvent accessible surface area SAcc divided or not by the length, are also among the top 20 features.

The remaining features in the top 20 are sequence-based: the difference between Lys and Arg composition (K-R) which is positively correlated with solubility ( Hou et al., 2018 Warwicker et al., 2014), the percentage of aromatic residues (F+Y+W) which favor aggregation ( Hou et al., 2018 Niwa et al., 2009), and the total fraction of negatively charged residues (D+E) that have also been shown to promote solubility ( Hou et al., 2018 Niwa et al., 2009). The next features are the composition in R and Q, which disfavors solubility, the composition in E and K, which instead promotes solubility, and the difference between the fraction of positively and negatively charged residues (K+R−D−E), which augments insolubility.

Note that all these sequence-based features have also been employed by the solubility predictors available in the literature. However, in addition to these commonly used features, we utilized a series of structure-based features among which the most important ones are obtained from the newly developed solubility-dependent statistical potentials. These capture the solubility properties in a more accurate way and represent the key instrument of our approach.

To further check the importance of considering the 3D structure, we trained a prediction model on the 28 sequence features considered here. As shown in Supplementary Table S2 , this model has a score of r = 0.59 in nested cross-validation on the D E . coli set, which is about 12% lower than the SOLart score of r = 0.66.

3.5 Performance on modeled protein structures

SOLart has been shown to be accurate when the 3D structure of the target protein is known. To enlarge its applicability, we tested it on low-resolution structures obtained via homology modeling. We first applied it to the M E . coli dataset containing 550 proteins from E.coli (see Section 2.2). We obtained a correlation of r = 0.51 and a RMSE of 28%, which is relatively good but lower than the performance on D E . coli ( Table 3). This drop is expected since we have to take into account the possible inaccuracies in the modeled structures that have to be added to the error of our computational method. After removing 10% outliers, the performance increases to r = 0.67 and RMSE = 23 % ⁠ , and reaches thus the same performance as on good-resolution structures.

As a last test set, we used M S . cerevisiae that contains S.cerevisiae proteins with modeled structures. The performance of SOLart on this set is given by r = 0.63 and RMSE = 24 % ⁠ , and increases up to r = 0.70 and RMSE = 20 % without 10% outliers. The scores are thus much higher on this test set than on the E.coli test set, which suggests that some structural protein models or experimental solubility values might be less accurate on the E.coli set than on the S.cerevisiae set.

Note that these tests are quite strict, since there is a low sequence similarity (≤25%) between these test sets and the training set. We thus conclude that SOLart can reliably be used to predict solubility not only for high-resolution experimental structures but also for modeled or other low-resolution structures.

3.6 Comparison with other solubility prediction methods

The performance of SOLart was compared with that of other solubility prediction methods on the combination of D S . cerevisiae and M S . cerevisiae sets, that group X-ray and modeled structures from S.cerevisiae proteins, as these are independent test sets that are not included in the training sets of any of the predictors. More precisely, we tested the methods Protein-SOL ( Hebditch et al., 2017), ccSOL ( Agostini et al., 2014), CamSol ( Sormanni et al., 2015), PROSO ( Smialowski et al., 2007), PROSO II ( Smialowski et al., 2012), Aggrescan3D 2.0 ( Kuriata et al., 2019), DeepSol ( Khurana et al., 2018), PaRSnIP ( Rawi et al., 2018) and SOLpro ( Magnan et al., 2009), by submitting to their respective webservers all the proteins from our test datasets or by installing locally their programs. Note that these methods are all sequence-based with the exception of Aggrescan3D 2.0.

The linear correlation coefficient r between the solubility predictions and the experimental values for all these predictors are given in Table 4. Our method clearly outperforms the competitors (r = 0.65 against r = 0.55 for the second best method). This demonstrates the importance of using structural information.

Comparison of the performance of different predictors on the combination of the D S . cerevisiae and M S . cerevisiae test sets, on the basis of the Pearson correlation coefficient between predicted and experimental solubility values

Predictor . r .
SOLart 0.65
ccSOL 0.55
Protein-Sol 0.53
CamSol 0.40
Aggrescan3D 2.0 0.36
DeepSol 0.30
PROSO 0.28
SOLpro 0.18
PROSO II 0.12
PaRSnIP 0.09
Predictor . r .
SOLart 0.65
ccSOL 0.55
Protein-Sol 0.53
CamSol 0.40
Aggrescan3D 2.0 0.36
DeepSol 0.30
PROSO 0.28
SOLpro 0.18
PROSO II 0.12
PaRSnIP 0.09

Comparison of the performance of different predictors on the combination of the D S . cerevisiae and M S . cerevisiae test sets, on the basis of the Pearson correlation coefficient between predicted and experimental solubility values

Predictor . r .
SOLart 0.65
ccSOL 0.55
Protein-Sol 0.53
CamSol 0.40
Aggrescan3D 2.0 0.36
DeepSol 0.30
PROSO 0.28
SOLpro 0.18
PROSO II 0.12
PaRSnIP 0.09
Predictor . r .
SOLart 0.65
ccSOL 0.55
Protein-Sol 0.53
CamSol 0.40
Aggrescan3D 2.0 0.36
DeepSol 0.30
PROSO 0.28
SOLpro 0.18
PROSO II 0.12
PaRSnIP 0.09

3.7 Webserver

We provided a freely available webserver interface for our prediction method, which targets non-expert users (http://babylone.ulb.ac.be/SOLART/index.php) ( Fig. 3). The input consists of the 3D structure of the target protein in PDB format. It can be uploaded directly by the user or imported from the PDB ( Berman et al., 2000) by typing its four-letter code. The webserver then provides a brief summary of some of the protein’s characteristics and allows the user to choose one of the protein chains. The computation starts after the query submission. All the structure-based free energy, secondary structure and solvent accessibility features are first computed and then integrated with the other, sequence-based, features.

The webserver interface of SOLart

The webserver interface of SOLart

In the output page, reached by following the link provided, the value of the predicted scaled solubility S is given. If the score is close to zero, the target protein is predicted as aggregation-prone and, when it is close to 130, as soluble. Moreover, to have an indication of the contribution of each single feature to the solubility prediction of the target protein, we also show a figure with the solubility predicted from each feature taken individually and with SOLart. The prediction with each single feature is computed from a random forest model trained on the experimental solubility values of the D E . coli set. This figure can be used as a source of inspiration to suggest the characteristics to modify in view of modulating solubility. An example is shown in Figure 4 for an acyltransferase from E.coli.

Predicted solubility of an example protein (PDB code 2qia, Uniprot code P0A722) with all features used in SOLart (horizontal line) or with each single feature only (histogram bars)

Predicted solubility of an example protein (PDB code 2qia, Uniprot code P0A722) with all features used in SOLart (horizontal line) or with each single feature only (histogram bars)

Due to its simplicity of use, we expect that this webserver will be of interest for researchers in academia and industry who are interested in modulating protein solubility without needing any prior bioinformatic knowledge.


UCLA MBI &mdash SERp Server: Introduction

The aim of this tool is to suggest mutation candidates that are likely to enhance a protein's crystallizability via the generation of crystal contacts by the Surface Entropy Reduction (SER) approach described by Derewenda (2004).

Derewenda argues that crystallizability is associated with surface properties of the proteins and that globular proteins recalcitrant to crystallization contain on their surface an "entropic shield", made up of long, flexible polar side chains that impede the protein's ability to form intermolecular contacts and thus to assemble into a crystalline lattice. Crystallization is driven by the free energy change from the supersaturated solution of protein to protein crystals in the solvent. Given that the enthalpy values of intermolecular interactions in the crystal lattice are typically small, crystallization is very sensitive to entropy changes involving both the solvent and the protein. Incorporation of protein molecules into the lattice carries a negative entropy term, and this is an inescapable thermodynamic cost. Furthermore, immobilization of side chains and solvent at the point of crystal contacts generates additional loss of entropy.

The Surface Entropy Reduction approach involves the replacement of surface exposed, high entropy amino acids with residues that have small, low entropy side chains such as alanines. Lysines and glutamates are of particular importance, since statistical analyses show that both types of residues are localized predominantly on the surface (Baud and Karlin, 1999) and are disfavored at protein-protein interfaces (Conte et al., 1999).

Job Submission

  • Amino acid or DNA sequence to be analyzed
  • A short sequence name identifier (primarily for the user's convenience)
  • An email address for results delivery

Initial processing typically takes a few minutes. The user will be notified by email upon completion current job and queue status are shown on the web page. Subsequent job parameter revisions take only a few seconds to process and are processed on demand.

Process Summary

The submitted sequence undergoes the following three primary analyses. Each analysis assigns either a positive or neative score to every residue in the sequence. Combined these analyses identify residues most favorable for mutation. A positive contribution from every model is not required, although higher positive scores indicate better candidates.

    Secondary structure prediction
    The secondary structure is predicted with PSIPRED which incorporates two feed-forward neural networks which perform an analysis on output obtained from PSI-BLAST. Predicted coil regions are marked as favorable sites for mutation as they tend to be surface exposed and so far proved very effective the entropy reduction concept was found to be less effective if the targeted patch lies on the solvent-exposed face of a helix.
    The score contribution from the secondary structure analysis is directly proportional to the confidence for a residue to be in a coil region. A graph showing the secondary structure confidences is provided on the Graphs tab.

  • Prefer residues that scored favorably in the primary analyses.
  • Maximize length of low entropy patch post mutation.
  • Minimize gaps in the low entropy patch.
  • Minimize number of required mutations.
  • Maximize side chain entropy reduction.

All proposed mutations within a cluster need to be introduced concurrently to ensure sufficient removal of the "entropy shield." By default a cluster will contain no more than three mutations to limit the reduction of the target protein solubility. Typically mutations from only one cluster are introduced into the protein target at a time, although larger proteins (>80 kD) may require concurrent mutation of several clusters. The protein target is often found to crystallize in a new space groups, with mutated patches directly involved in new crystal contacts.

Finally, a meta search is performed on the submitted sequence. This search attempts to detect other potential crystallization failure modes such as the requirement of metal ions or other small molecules, or interacting protein partners.

Results

The results are presented interactively on the website with internal links to analysis details as well as links to external sources. A condensed version of the results can also be delivered by email.

Summary Tab. The Summary tab contains a very brief synopsis of the proposed mutations. The mutations are proposed in groups or clusters and all proposed mutations within a cluster should be introduced together. By default clusters are sorted by the prediction confidence and thus the first returned cluster is expected to be most successful in improving crystallization and/or diffraction quality for the provided sequence. The success confidence score is displayed as well two clusters may have similar confidence scores and thus either one of both proposals should be pursue independently.
Analysis details can be found on the Score Details tab. A graphical representation of the proposed mutation sites, secondary structure prediction and entropy profiles are on the Graphs Tab. Aligned sequences are on the Blast tab.

Score Details Tab. Score contributions making up the total score at each residue position can be found on this tab. A cluster is typically less than 10 amino acids in size and contains some non-mutable or non-high entropy amino acids. A patch of residues within a cluster that is predicted to be most successful highlighted proposed mutations are shaded green, and target residues are shaded yellow.

    SS Coil Confidence: Confidence in the range of 0 - 1.0 for a residue to be in a coil region, as predicted by PSIPRED.

Graphs Tab. The following graphs are provided to aid visualization of the proposed mutation sites, and to help understand the contribution of each analysis. Taken together, all analyses determine which sites are most suitable for mutation.

Overall Score: this stacked graph represents the score contribution from each analysis to the total score at each residue position. Refer to the legend and on the Graphs tab. Peaks indicate regions that are predicted to contain best mutation candidates to improve crystallization and/or diffraction quality.
Proposed clusters are highlighted and the cluster rank and score are shown. Residues proposed for mutation are shaded green.

A graphical representation of high entropy, mutable and low entropy target residues is shown on the bottom of this graph both pre and post mutation, respectively.

    Blast Results: Number of sequences found by the PSI-BLAST search containing the same residue as the submitted sequence (conserved residue) and a target residue (mutated), respectively.

Blast Tab. Alignment results returned by PSI-BLAST. Top 50 (or fewer) alignments are shown, in default BLAST order by decreasing identity. The expectation value, bit score and sequence identity percentage to the provided sequence are shown for each alignment. A brief sequence annotation and an external link are provided.

For each proposed cluster, the residues in the aligned sequences are shown. A period indicates no change from the provided sequence. A gap in the aligned sequence is shown as '-'. An insertion in the aligned sequence is not shown. For convenience, high entropy amino acids are shown in red, and target amino acids in green.

The complete alignment and additional references (if any) are shown by clicking the expansion [+] link.

Meta Search Tab. Details results from the performed Meta Searches are shown on this tab.
Each BLAST-aligned sequence is screened for potential functional linkages. For each aligned sequences, potential matches are shown. Click the [+] expansion link to see all linkages, and detection method and confidence for each. Each linkage can be further examined on the ProLinks server using the provided link.


Solvent Accessibility, the 20% cut-off method - Biology

Properties of Organic Solvents

The values in the table below except as noted have been extracted from online and hardbound compilations . Values for relative polarity, eluant strength, threshold limits and vapor pressure have been extracted from: Christian Reichardt, Solvents and Solvent Effects in Organic Chemistry, Wiley- VCH Publishers, 3rd ed., 2003 . For Spectra of Solvents , jump to the bottom of this p http://murov.info/webercises.htm age. For an Organic Chemistry Directory, see: http://murov.info/orgchem.htm .
For a Chemistry Directory, see: http://murov.info/webercises.htm
For much more complete information on physical and safety properties of solvents, please go to:
http://www.knovel.com/web/portal/browse/display?_EXT_KNOVEL_DISPLAY_bookid=761
http://chem.sis.nlm.nih.gov/chemidplus/chemidlite.jsp
The tables below were posted (10/23/98) and revised (07/28/09) and updated (04/10/10) by Steve Murov, Professor Emeritus of Chemistry.

Solvent formula boiling point ( o C) melting p oint ( o C) density
(g/mL)
solubility in H2O 1 ( g/100g) relative
polarity 2
eluant strength 3 threshold limits 4 (ppm) vapor pressure 20 o C (hPa)
acetic acid C2H4O2 118 16.6 1.049 M 0.648 >1 10 15.3
acetone C3H6O 56.2 -94.3 0.786 M 0.355 0.56 500 240
acetonitrile C2H3N 81.6 -46 0.786 M 0.460 0.65 20 97
acetyl acetone C5H8O2 140.4 -23 0.975 16 0.571
2 - aminoethanol C2H7NO 170.9 10.5 1.018 M 0.651 3 0.53
aniline C6H7N 184.4 -6.0 1.022 3.4 0.420 2 0.4
anisole C 7 H8O 153.7 -37.5 0.996 0.10 0.1 98
benzene C6H6 80.1 5.5 0.879 0.18 0.111 0.32 0.5 101
benzonitrile C7H5N 205 -13 0.996 0.2 0.333 10 12
benzyl alcohol C 7 H 8 O 205.4 -15.3 1.042 3.5 0.608
1-butanol C4H10O 117.6 -89.5 0.81 7.7 0. 586 20 6.3
2 -butanol C4H10O 99.5 - 114.7 0.808 18.1 0 .506 100
i-butanol C4H10O 107.9 -108.2 0.803 8.5 0 .552
2-butanone C4H8O 79.6 -86.3 0.805 25.6 0.327 0.51 200 105
t-butyl alcohol C4H10O 82.2 25.5 0.786 M 0.389 100 41
carbon disulfide CS2 46.3 -111.6 1 .263 0.2 0.065 0.15 10 400
carbon tetrachloride CCl4 76.7 -22.4 1.594 0.08 0.052 0.18 5 120
chlorobenzene C6H5Cl 132 -45.6 1.106 0.05 0.188 0.30 10 12
chloroform CHCl3 61.2 -63.5 1.498 0.8 0.259 10 2 10
cyclohexane C6H12 80.7 6.6 0.779 0.005 0.006 0.04 100 104
cyclohexanol C 6 H 12 O 161.1 25.2 0.962 4.2 0.509 50 1.2
cyclohexanone C6H10O 155.6 -16.4 0.948 2.3 0.281 25 5
di-n-butylphthalate C16H22O4 340 -35 1.049 0.0011 0.272
1,1-dichloroethane C2H4Cl2 57.3 -97.0 1.176 0.5 0.269 100 240
diethylene glycol C4H10O3 245 -10 1.118 M 0.713 0.027
diglyme C6H14O3 162 -64 0.945 M 0.244
dimethoxyethane (glyme) C4H10O2 85 -58 0.868 M 0.231
N,N-dimethylaniline C8H11N 194.2 2.4 0.956 0.14 0.179
dimethylformamide (DMF) C3H7NO 153 -61 0.944 M 0. 386 10 3.5
dimethylphthalate C10H10O4 283.8 1 1.190 0.43 0.309
dimethylsulfoxide (DMSO) C2H6OS 189 18.4 1.092 M 0.444 0.75
dioxane C4H8O2 101.1 11.8 1.033 M 0.164 0.56 20 41
ethanol C2H6O 78.5 -114.1 0.789 M 0.654 0.88 100 59
ether C4H10O 34.6 -116.3 0.713 7.5 0.117 0.38 400 587
ethyl acetate C4H8O2 77 -83.6 0.894 8.7 0.228 0.58 400 97
ethyl acetoacetate C6H10O3 180.4 -80 1.028 2.9 0.577
ethyl benzoate C9H10O2 213 -34.6 1.047 0.07 0.228
ethylene glycol C2H6O2 197 -13 1.115 M 0.790 1.11
glycerin C3H8O3 290 17.8 1.261 M 0.812
heptane C7H16 98 -90.6 0.684 0.0003 0.012 400 48
1-heptanol C 7 H 16 O 176.4 -35 0.819 0.17 0.549
hexane C6H14 69 -95 0.655 0.0014 0.009 0.01 50 160
1-hexanol C 6 H 14 O 158 -46.7 0.814 0.59 0.559
methanol CH4O 64.6 -98 0.791 M 0.762 0.95 200 128
methyl acetate C 3 H 6 O2 56.9 -98.1 0.933 24.4 0.253 200 220
methyl t-butyl ether (MTBE) C5H12O 55.2 -109 0.741 4.8 0.1 24 0.20
methylene chloride CH2Cl2 39.8 -96.7 1.326 1.32 0.309 0.42 50 475
1-octanol C 8 H 18 O 194.4 -15 0.827 0.096 0.537
pentane C5H12 36.1 -129.7 0.626 0.004 0.009 0.00 600 573
1-pentanol C 5 H 12 O 138.0 -78.2 0.814 2.2 0.568
2-pentanol C 5 H 12 O 119.0 -50 0.810 4.5 0.4 8 8
3-pentanol C 5 H 12 O 115.3 -8 0.821 5.1 0.463
2-pentanone C 5 H 10 O 102.3 -76.9 0.809 4.3 0.321
3-pentanone C5H12O 101.7 -39.8 0.814 3.4 0.265 200
1-propanol C3H8O 97 -126 0.803 M 0.617 0.82
2-propanol C3H8O 82.4 -88.5 0.785 M 0.546 0.82 400 44
pyridine C5H5N 115.5 -42 0.982 M 0.302 0.71 5 20
tetrahydrofuran(THF) C4H8O 66 -108.4 0.886 30 0.207 0.57 200 200
toluene C7H8 110.6 -93 0.867 0.05 0.099 0.29 50 29
water H2O 100.00 0.00 0.998 M 1.000 >>1
water, heavy D2O 101.3 4 1.107 M 0.991
p-xylene C8H10 138.3 13.3 0.861 0.02 0.074 0.26 100 15

1 M = miscible.
2 The values for relative polarity are normalized from measurements of solvent shifts of absorption spectra and were
extracted from Christian Reichardt, Solvents and Solvent Effects in Organic Chemistry, Wiley- VCH Publishers, 3rd ed., 2003.
3 Snyder's empirical eluant strength parameter for alumina. Extracted from Reichardt, page 495.
4 Threshold limits for exposure. Extracted from Reichardt, pages 501-502.

TABLE 2


Results

Number of false positives exploded in twilight zone

In contrast to 1990, when Sander and Schneider (1991) compiled their data, now protein pairs of dissimilar structure were detected above the 30% cut-off (Figure 2A ). And these were not exceptions: at a level of 32% (HSSP-curve + 7%, i.e. n = 7 in eqn 1), the number of false positives already equalled that of homologues. For the original HSSP-curve the number of false positives was 20-fold higher than the number of true pairs. The transition from 20 to 30% sequence identity was highly non-linear for true, and false positives (logarithmic scales in Figure 2 ): the number of true pairs rose by a factor of 5, that of false pairs by a factor of 200 (Figure 2B ). Thus, below the region of significant pairwise sequence identity (>34%) the population of false positives exploded. However, also the vast majority of homologues had less than 30% sequence identity.

Functional shape of original HSSP-curve adequate

The functional shape of the original HSSP-curve proved to be basically correct (Figure 3 , grey line with triangles). However, the larger data set analysed here revealed several problems in detail (Figure 3B ). (i) A threshold of 25% was not reasonable for an alignment length below 150–200 residues. (ii) Above an alignment length of about 100 residues, the derivative of the curve separating true and false positives should be lower than at lengths below 80. I attempted to solve these problems by defining a new curve for separating true and false positives (eqn 2 Figure 3 , grey line with dotted circles). The particular functional form guaranteed an approximate saturation for long alignments. For alignments shorter than 11 residues eqn 2 yielded values above 100%. However, this was acceptable as 100% identity for fragments of 10–11 residues does not imply structural similarity ( Cerpa et al., 1996 Minor and Kim, 1996 Muñoz and Serrano, 1996). The new curve saturated around 20% for alignments over more than 250 residues.

Defining a curve for pairwise sequence similarity

Compiling sequence identity neglects the physico-chemical nature of amino acids. Any multiple sequence alignment illustrates that, for example, the feature hydrophobicity is more conserved than is the residue type. For the million protein pairs investigated here, this was reflected in a shift of the scatter plot towards lower percentages (Figure 4 ). In particular, for longer alignments false positives fall below 15% pairwise sequence similarity. This prompted the introduction of a threshold specifically for sequence similarity (eqn 3 in Methods Figure 4 , grey line with dotted circles). The curve surpassed 100% for alignments shorter than 12 residues and saturated at about 10% for alignments over more than 500 residues.

Better detection of homologues in twilight zone by new curves

The new curves for length-dependent cut-offs in sequence identity (eqn 2) and similarity (eqn 3) resulted in clearly lower false positive rates (higher accuracy) than the original HSSP-curve (Figure 5B and C ). This was paid for by a lower number of true positives detected (lower coverage Figure 5A ). At the n = 0 (eqn 1–3), the old curve yielded about twofold more true positives, but more than 20-fold more false positives compared to the new curves for identity and similarity. Furthermore, at any level of true positives detected, the number of false positives was smaller for the new curves (eqn 2–3) than for the original HSSP-curve (eqn 1 Figure 7 ). When applying a cut-off according to mere sequence identity (ignoring alignment length), accuracy dropped below 10% at levels of 30% sequence identity (Figure 5C ). Thus, detection accuracy rose almost 10-fold by the new curves.

Improving detection accuracy by expert rule

Experts often apply rules-of-thumb to visually distinguish true and false positives. However, many of such simple rules appeared not valid for automatic implementation. In particular, the distributions of the number and length of insertions did not, on average, differ between false and true positives (data not shown). Detection accuracy improved marginally by applying the following rules: (i) compile the distance for the similarity score n S (eqn 3), and the identity score n I (eqn 2), average over both ([n S + n I ]/2), and accept pairs when this average is above some threshold n (ii) take pairs whenever either identity or similarity surpassed the respective threshold (either n S Ú n I > n) (iii) take pairs if both values where above a given cut-off (n S Ù n I > n). In contrast, detection accuracy increased significantly by applying the `more-similar-than-identical' rule: accept hits found in a database search only if percentage similarity is larger than percentage identity. This constraint resulted in >98% detection accuracy at n = 0 cut-off levels (eqn 2–3), while 2–4-fold less true positives were found at this level (Figure 5A and C ). Hence, applied as a conservative cut-off in automatic database searches, this rule proved rather powerful.

Improving detection accuracy by sequence-space-hopping

Hopping in sequence space proved successful in discarding false positives. Already the minimal constraint to accept a pair if at least one protein was common between the two sequence families yielded levels of around 80% accuracy even down to cut-off levels corresponding to 20% sequence identity (Figure 6A , compared with <20% accuracy for the normal thresholds Figure 5C ). Accuracy increased further when more proteins were required to be common to both families (Figure 6A ). However, sequence space hopping was possible for only relatively few protein pairs (Figure 6B ). Furthermore, the improvement in accuracy was less clear using sequence-space-hopping than by applying the `more-similar-than-identical' rule (Figure 5 ).

Accuracy versus coverage for BLAST and full dynamic programming

The balance between accuracy (percentage of true pairs) and coverage (percentage of all true pairs) enables choosing automatic thresholds according to a particular purpose of a database search. It also permits comparing different methods (the higher the values, the better). (i) As expected, the commonly used simple level of sequence identity (disregarding alignment length) proved, again, an extremely bad choice. (ii) Surprisingly, the fast database searching method BLAST performed relatively well in comparison to the full dynamic programming (Figure 7A ). (iii) Both BLASTP version 2 and PSI-BLAST were almost as good as the full dynamic programming with the previously defined HSSP-threshold ( Sander and Schneider, 1991). (iv) Best performance was achieved by the new threshold for similarity (eqn 3). (v) However, the raw alignment score performed almost as well. (vi) BLASTP ( Altschul et al., 1990) performed rather similarly to the more elaborate and more recent PSI-BLAST ( Altschul et al., 1997) (and for `high' accuracy even slightly better, Figure 7A inset note: given that standard parameters were chosen, this was not surprising). The corresponding thresholds were given in Figure 5B for the dynamic programming, and in Figure 7B for the PSI-BLAST probabilities.

Many false negatives at reasonable cut-off values

The number of false negatives is often of interest, i.e. the number of proteins that belong to a structure family but were not detected above a given cut-off. For the data sets used here, the cumulative percentage of false negatives was extremely high for all reasonable cut-off levels (Figure 5D ). The vast majority of all pairs of proteins with similar structure populate the midnight zone below 10% sequence identity (Rost, 1997). Thus, the extremely high false negative rates proved that methods aligning two proteins merely based on the pairwise levels of sequence homology clearly fail to find the gold mine of database searches (and that older analyses that failed to describe this effect were based on biased data sets).

Thresholds for practical use

For simplicity the functions (eqn 1–3) were explicitly provided in tables ( Rost, 1998). At levels of n = 0 (eqn 1–3) the cumulative number of true positives were (Figure 5 ): HSSP-curve (eqn 1), 12% new identity curve (eqn 2), 56% new similarity curve (eqn 3), 73%. In order to achieve levels of 99% correct hits m percentage points have to be added to the curves, where m was HSSP-curve, m = 8 new identity curve, m = 5 new similarity curve, m = 12. For comparison, applying the `more-similar-than-identical' rule yielded levels above 99% down to m = –1.


Footnotes

This article has been edited by the Royal Society of Chemistry, including the commissioning, peer review process and editorial aspects up to the point of acceptance.

Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.

References

. 1963 Solid phase peptide synthesis. I. The synthesis of a tetrapeptide . J. Am. Chem. Soc. 85, 2149–2154. (doi:10.1021/ja00897a025) Crossref, Google Scholar

. 1999 Orthogonal ligation strategies for peptide and protein . Biopolymers 51, 311–332. (doi:10.1002/(SICI)1097-0282(1999)51:5<311::AID-BIP2>3.0.CO2-A) Crossref, PubMed, Google Scholar

. 2000 Synthesis of native proteins by chemical ligation . Annu. Rev. Biochem. 69, 923–960. (doi:10.1146/annurev.biochem.69.1.923) Crossref, PubMed, Google Scholar

. 2009 Total chemical synthesis of proteins . Chem. Soc. Rev. 38, 338–351. (doi:10.1039/B700141J) Crossref, PubMed, Google Scholar

. 2010 Advances in chemical ligation strategies for the synthesis of glycopeptides and glycoproteins . Chem. Commun. 46, 21–43. (doi:10.1039/B913845E) Crossref, PubMed, Google Scholar

. 2014 Advance in ligation techniques for peptide and protein synthesis . Amino Acids Pept. Proteins 39, 1–20. (doi:10.1039/9781849739962-00001) Crossref, Google Scholar

. 2017 Progress in chemical synthesis of peptides and proteins . Trans. Tianjin Univ. 23, 401–419. (doi:10.1007/s12209-017-0068-8) Crossref, Google Scholar

Qi YK, Tang S, Huang YC, Pan M, Zheng JS, Liu L

. 2016 Hmb off/on as a switchable thiol protecting group for native chemical ligation . Org. Biomol. Chem. 14, 4194–4198. (doi:10.1039/C6OB00450D) Crossref, PubMed, Google Scholar

. 1998 Expressed protein ligation, a novel method for studying protein-protein interactions in transcription . J. Biol. Chem. 273, 16 205–16 209. (doi:10.1074/jbc.273.26.16205) Crossref, Google Scholar

. 1998 Expressed protein ligation: a general method for protein engineering . Proc. Natl Acad. Sci.USA 95, 6705–6710. (doi:10.1073/pnas.95.12.6705) Crossref, PubMed, Google Scholar

Becker C, Hunter CF, Seidel R, Kent SBH, Goody RS, Engelhard M

. 2003 Total chemical synthesis of a functional interacting protein pair: the protooncogene H-Ras and the Ras-binding domain of its effector c-Raf1 . Proc. Natl Acad. Sci. USA 100, 5075–5080. (doi:10.1073/pnas.0831227100) Crossref, PubMed, Google Scholar

. 2004 A one-pot total synthesis of Crambin . Angew. Chem. 43, 2534–2538. (doi:10.1002/anie.200353540) Crossref, PubMed, Google Scholar

Li JB, Li YY, He QQ, Li YM, Li HT, Liu L

. 2014 One-pot native chemical ligation of peptide hydrazides enables total synthesis of modified histones . Org. Biomol. Chem. 12, 5435–5441. (doi:10.1039/C4OB00715H) Crossref, PubMed, Google Scholar

Ollivier N, Vicogne J, Vallin A, Drobecq H, Desmet R, Mahdi Q, Leclercq B, Goormachtigh G, Fafeur V, Melnyk O

. 2012 A one-pot three-segment ligation strategy for protein chemical synthesis . Angew. Chem. Int. Ed. 51, 209–213. (doi:10.1002/anie.201105837) Crossref, PubMed, Google Scholar

Aihara K, Yamaoka K, Naruse N, Inokuma T, Shigenaga A, Otaka A

. 2016 One-pot/sequential native chemical ligation using photocaged crypto-thioester . Org. Lett. 18, 596–599. (doi:10.1021/acs.orglett.5b03661) Crossref, PubMed, Google Scholar

Otaka A, Sato K, Ding H, Shigenaga A

. 2012 One-pot/sequential native chemical ligation using N-sulfanylethylanilide peptide . Chem. Rec. 12, 479–490. (doi:10.1002/tcr.201200007) Crossref, PubMed, Google Scholar

Asahina Y, Kawakamia T, Hojo H

. 2017 One-pot native chemical ligation by combination of two orthogonal thioester precursors . Chem. Commun. 53, 2114–2117. (doi:10.1039/C6CC10243C) Crossref, PubMed, Google Scholar

Bang D, Pentelute BL, Kent SB

. 2006 Kinetically controlled ligation for the convergent chemical synthesis of proteins . Angew. Chem. Int. Ed. Engl. 45, 3985–3988. (doi:10.1002/anie.200600702) Crossref, PubMed, Google Scholar

. 2007 Sequential peptide ligation by using a controlled cysteinyl prolyl ester (CPE) autoactivating unit . Tetrahedron Lett. 48, 1903–1905. (doi:10.1016/j.tetlet.2007.01.086) Crossref, Google Scholar

Zheng JS, Cui HK, Fang GM, Xi WX, Liu L

. 2010 Chemical protein synthesis by kinetically controlled ligation of peptide O-esters . ChemBioChem 11, 511–515. (doi:10.1002/cbic.200900789) Crossref, PubMed, Google Scholar

Erlich LA, Kumar KS, Haj-Yahya M, Dawson PE, Brik A

. 2010 N-methylcysteine-mediated total chemical synthesis of ubiquitin thioester . Org. Biomol. Chem. 8, 2392–2396. (doi:10.1039/c000332h) Crossref, PubMed, Google Scholar

Fang GM, Li YM, Shen F, Huang YC, Li JB, Lin Y, Cui HK, Liu L

. 2011 Protein chemical synthesis by ligation of peptide hydrazides . Angew. Chem. Int. Ed. Engl. 50, 7645–7649. (doi:10.1002/anie.201100996) Crossref, PubMed, Google Scholar

Yang R, Hou W, Zhang X, Liu CF

. 2012 N-to-C sequential ligation using peptidyl N,N-bis(2-mercaptoethyl)amide building blocks . Org. Lett. 14, 374–377. (doi:10.1021/ol2031284) Crossref, PubMed, Google Scholar

Bello C, Wang S, Meng L, Moremen KW, Becker C

. 2015 A PEGylated photocleavable auxiliary mediates the sequential enzymatic glycosylation and native chemical ligation of peptides . Angew. Chem. Int. Ed. 54, 7711–7715. (doi:10.1002/anie.201501517) Crossref, PubMed, Google Scholar

Schwagerus S, Reimann O, Despres C, Smet-Nocca C, Hackenberger C

. 2016 Semi-synthesis of a tag-free O-GlcNAcylated tau protein by sequential chemoselective ligation . J. Pept. Sci. 22, 327–333. (doi:10.1002/psc.2870) Crossref, PubMed, Google Scholar

Takenouchi T, Katayama H, Nakahara Y, Nakahara Y, Hojo H

. 2014 A novel post-ligation thioesterification device enables peptide ligation in the N to C direction: synthetic study of human glycodelin . J. Pept. Sci. 20, 55–61. (doi:10.1002/psc.2592) Crossref, PubMed, Google Scholar

Lee CL, Liu H, Wong CTT, Chow HY, Li XC

. 2016 Enabling N-to-C Ser/Thr ligation for convergent protein synthesis via combining chemical ligation approaches . J. Am. Chem. Soc. 138, 10 477–10 484. (doi:10.1021/jacs.6b04238) Crossref, Google Scholar

Hou W, Zhang X, Li FP, Liu CF, Peptidyl N

. 2011 N-bis(2-mercaptoethyl)-amides as thioester precursors for native chemical ligation . Org. Lett. 13, 386–389. (doi:10.1021/ol102735k) Crossref, PubMed, Google Scholar

Ollivier N, Dheur J, Mhidia R, Blanpain A, Melnyk O

. 2010 Bis(2-sulfanylethyl)amino native peptide ligation . Org. Lett. 12, 5238–5241. (doi:10.1021/ol102273u) Crossref, PubMed, Google Scholar

. 1996 Acyl disulfide-mediated intramolecular acylation for orthogonal coupling between unprotected peptide segments. Mechanism and application . Tetrahedron Lett. 37, 933–936. (doi:10.1016/0040-4039(95)02394-1) Crossref, Google Scholar

Dawson PE, Muir TW, Clark-Lewis I, Kent SB

. 1994 Synthesis of proteins by native chemical ligation . Science 266, 776–779. (doi:10.1126/science.7973629) Crossref, PubMed, Google Scholar

. 1972 Purification of monellin, the sweet principle of Dioscoreophyllum cumminsii . Biochim. Biophys. Acta 261, 114–122. (doi:10.1016/0304-4165(72)90320-0) Crossref, PubMed, Google Scholar

. 1973 Chemostimulatory protein: a new type of taste stimulus . Science 181, 32–35. (doi:10.1126/science.181.4094.32) Crossref, PubMed, Google Scholar

Tancredi T, Iijima H, Saviano G, Amodeo P, Temussi PA

. 1992 Structural determination of the active site of a sweet protein: a 1 H NMR investigation of pMNEI . FEBS Lett. 310, 27–30. (doi:10.1016/0014-5793(92)81138-C) Crossref, PubMed, Google Scholar

. 2008 Solid-phase synthesis of peptide thioacids through hydrothiolysis of resin-bound peptide thioesters . Tetrahedron Lett. 49, 6122–6125. (doi:10.1016/j.tetlet.2008.08.018) Crossref, Google Scholar

Kaiser E, Colescott RL, Bossinger CD, Cook PI

. 1970 Color test for detection of free terminal amino groups in the solid-phase synthesis of peptides . Anal. Biochem. 34, 595–598. (doi:10.1016/0003-2697(70)90146-6) Crossref, PubMed, Google Scholar

Kim SH, Kang CH, Kim R, Cho JM, Lee YB, Lee TK

. 1989 Redesigning a sweet protein: increased stability and renaturability . Protein Eng. 2, 571–575. (doi:10.1093/protein/2.8.571) Crossref, PubMed, Google Scholar


Materials and methods

Non-redundant set of protein structures

The redundancy in the PDB database (June 2005) was filtered to a representative list such that the MAMMOTH alignment [27] of any two chains in the list fails at least one of the following four cut-offs: a minimum of 90% sequence identity a minimum of 90% of Cα atoms aligned within 4 Å a maximum of 1 Å Cα root mean square deviation and a maximum of a 50 residue difference in length. Each non-redundant chain represents all other PDB chains in the initial list that pass the cut-offs listed above for all pairwise comparisons within the group where possible, the representative was picked by maximizing its resolution. Additionally, obsolete PDB entries as well as entries with missing atoms were removed from the initial set, resulting in a final list of 22,732 protein chains. To assess the impact of the PDB redundancy on the accuracy of the EvPs in model assessment, the final representative set of chains was further clustered by varying the sequence identity and structure similarity cut-offs (Table S1 in Additional data file 1).

Multiple sequence alignments

A MSA for each of the 22,732 non-redundant PDB chains was built using PSI-BLAST (version 2.2.10) [28] to search against the NCBI nr database (June 2005). The search was performed without filtering out compositionally biased segments, running for up to 5 iterations, and including up to 100,000 sequence hits with an e-value smaller than 5 × 10 -4 . All other PSI-BLAST parameters were set to their default values. Removing those protein chains that aligned with less than 20%, 40% or 60% sequence identity to the query protein further filtered the MSAs. Finally, all filtered MSAs with 50 or more sequences were used for deriving EvPs (Table S1 in Additional data file 1).

Sequence weighting

A position-based sequence weighting that assigns low weights to over-represented sequences and high weights to unique sequences was used to compensate for non-uniform distribution of the homologous protein sequences in a MSA [29]. The sequence weights W jwere calculated as:

where r iis the number of different residue types at position i, and ni,jis the frequency of occurrence of the residue type in position i and sequence j with respect to all residues in position i.

Derivation of knowledge-based potentials

Two different types of knowledge-based potentials were derived in this work: a representative distance-dependent potential (REP), used as a baseline to benchmark the impact of our new approach, and a series of structure specific distance-dependent potentials here termed EvPs. The unique difference between the REP and the EvP potentials was the input structural space selected for their derivation as well as the use of sequence information. On the one hand, the REP potential was calculated from a set of 22,732 non-redundant protein structures (Figure 4a) following the approach commonly used to derive distance-dependent potentials [7, 19, 30–35]. On the other hand, for 20,008 of the 22,732 non-redundant protein structures (that is, structures with more than 50 homologous sequences in their MSA), an EvP was calculated using the sequence variability in a set of homologous sequences to the selected structure (Figure 4b). Each EvP was derived by virtually threading all homologous sequences in the MSA into the selected structure, which was used as a guide for the replacement of the amino-acid type at each position. Thus, one can say that the 20,008 EvPs encode the sequence variation observed in the MSA for each of the non-redundant structures. Briefly, the threading approach implemented for deriving EvPs followed three steps: first, collect all pairwise alignments between the selected structure and its homologous sequences in the MSA second, using each pairwise alignment as a guide, replace the amino-acid type in the selected structure by the one in the homologous sequence and third, for a gapped position keep the original residue in the selected structure. Two variations of this protocol were also tested, which included the removal of residues in the structure aligned to a gap and the renumbering of the template residues (that is, affecting the sequence separation value of the statistical potential). The tested protocols showed no statistical differences between the resulting EvPs (Table S6 in Additional data file 1). The counting of residue-residue interactions for deriving an EvP was proportional to the sequence weight that accounts for redundancy within the MSA.

EvP and REP derivation protocols. (a) The REP potential was built in a three-step process to: step 1, generate a non-redundant set of protein structures from the PDB database step 2, calculate all residue-residue distance frequencies within each of the representative chains from step 1 and step 3, derive a knowledge-based potential using the inverse Boltzmann law to transform the raw frequencies into pseudo-energy terms. (b) The EvPs were built in a six-step process to: step 1, generate a non-redundant set of protein structures from the PDB database step 2, select each of the representative chains as query structures step 3, calculate a MSA using the PSI-BLAST program step 4, thread all homologous sequences into the query structure using the sequence-based alignment from the previous step step 5, calculate all residue-residue distance frequencies and step 6, derive a knowledge-based potential using the inverse Boltzmann law to transform the raw frequencies into pseudo-energy terms.

In contrast to the REP, where the non-redundant set of protein structures constituted its training set, there was not a single and unique training set for deriving an EvP. The training sets used in EvPs were the actual multiple sequence alignments specific for each selected structure.

In addition to the REP and the EvPs, a single consensus potential (CON) was derived using the sum of observed interaction frequencies from each of the 20,008 individual EvPs. Thus, the CON potential encodes the structural space encompassed by the non-redundant set of structures as well as the sequence space occupied by their homologous sequences.

All potentials derived in this work were calculated using our previously optimized parameters for model assessment [7]. Briefly, the potentials used Cα and Cβ atoms as interaction centers, distinguished between all 20 standard residue types, had a maximal distance range of 15 Å distributed in 30 bins of 0.5 Å each, and accounted for the sequence separation of the interacting atom pairs. Local interactions were considered independently using sequence separations of 2, 3, 4, 5, 6, 7 and 8 residues and non-local interactions were considered by grouping into a single term the interactions with sequence separations larger than or equal to 9 residues.

Z-scores

Energy Z-scores were calculated based on the protein model energy, the mean and the standard deviation of the knowledge-based potential energy of 1,000 random sequences with the same amino acid composition and structure of the protein model, as previously described [7].

Model assessment protocol

An EvP was calculated for each of the non-redundant chains in the PDB and represented a given set of similar structures. Thus, the selection of an EvP for assessing the accuracy of a given model could have an impact on the final accuracy of our method. Several protocols were implemented and tested to assess such an impact.

Template-based selection

The template structure used to build the model was obtained from the corresponding sequence-structure alignment used during the modeling. Then, the EvP representing the template's structural cluster was used to evaluate the accuracy of the model.

Template-free selection

In order to assess the impact of the EvP selection for template-free models, the PSI-BLAST and BLAST algorithms were used with default values to detect the closest match between the sequence of the model and our database of EvPs.

Random selection

The so-called random potential (RND) was calculated by randomly selecting one of the 20,008 EvPs to assess the accuracy of a given model.

To avoid biased results, the EvP derived for the target structure was removed prior to EVP selection in all three protocols. However, it is important to note that it is not certain, even conceptually, that rigorous testing of a method should not rely on structures similar or identical to those from which the potentials were derived. In practice, statistical potentials are to be used in model assessment of comparative models that, by construction, are similar to known protein structures. Therefore, all of the known protein structures are legitimate sources for deriving any of the statistical potentials used in practical model assessment, including those known structures that happen to be related to the assessed model.

Test set of comparative models

The evaluation of the EvPs for model assessment was based on an initial set of 9,645 structural models divided into 3,375 correct and 6,270 incorrect models [7, 22]. A correct model was defined as a model for which at least 30% of the Cα atoms superimposed within 3.5 Å with those of the real structure, and thus is based on proper fold assignment and a relatively accurate sequence/structure alignment. Incorrect models (that is, superimposing less than 15% of the Cα atoms within 3.5 Å) were built using a wrong fold or based on the correct fold, but containing a large fraction of misalignments. Thus, the test set of protein structure models, which was the result of a large-scale comparative modeling of the complete PDB [22], represented the known protein structural space. This set of comparative models has been previously and extensively used to benchmark methods of model assessment [7, 17, 22, 36, 37].

To be able to fairly compare all potentials, the initial test set was reduced to 1,877 correct and 2,567 incorrect models, which corresponded to those for which an EvP could be derived for all clustering cut-offs (Table S1 in Additional data file 1). Since an EvP cannot be reliably derived for representative structures with less than 50 homologous sequences [7], a large fraction of models did not have a derived EvP for their corresponding template structures in the CLS-90-90_MSA-60 cluster. However, an EvP at CLS-90-90 and MSA-20, which corresponds to the most accurate knowledge-based potential (Results), could be calculated for 96.4% (3,253) and 94.8% (5,942) of correct and incorrect models in the test set, respectively.

All potential scores, the models for the two datasets used in this work as well as the EvPs are available for download at [38].

Benchmarking criteria

The accuracy of the knowledge-based potentials was evaluated by means of the maximal accuracy (ACC) and the AUC, which were calculated from a receiver operating characteristic (ROC) curve [39] using correct models as positive instances and incorrect models as negative instances. A ROC curve is obtained by plotting the FPR (that is, fraction of incorrect models assessed as correct) against the corresponding TPR (that is, fraction of correct models assessed as correct) for all possible cut-offs on the energy Z-score. The AUC, a threshold independent measure, is considered a robust indicator of a classifier quality given its independence from the selected threshold and its correlation with the probability of the classifier error [39]. The optimal classification threshold leading to the maximal ACC is also reported for each tested potential.

Other benchmarked methods

Two widely used knowledge-based potentials for error detection in protein structure models were also evaluated to provide an additional and objective reference frame for evaluating the accuracy of the EvPs. First, the Prosa II program [4, 20, 21], derived from a set of non-redundant structures, calculates an energy score and a Z-score for an input model. Second, the DFIRE program [19], derived by using a distance-scaled finite ideal-gas as reference state, calculates an energy score for a model. The final DFIRE Z-scores were calculated using the procedure described above. Both programs, Prosa II and DFIRE, were locally run using their respective default parameters.

Statistical significance of the differences between the evaluated potentials

The statistical significance of the observed differences between two potentials used as binary classifiers was evaluated by a non-parametric test that accounts for the correlation of the ROC curves [40]. This test takes advantage of the equality between the Mann-Whitney U-statistic and the AUC when computed by the trapezoidal rule for comparing two distributions. A chi-square statistic computes the significance (p-value) of the difference between the AUC measured for the two classifiers. The results corresponding to the statistical comparisons are reported in the Additional data file 1 (Tables S1, and S3-S5).


Watch the video: Εκδήλωση ενημέρωσης προσβασιμότητας για ΑΜΕΑ (January 2022).