We are searching data for your request:
Upon completion, a link will appear to access the found materials.
We know that software tools like I-tasser etc have a web server and a standalone option too. Is there any difference to protein C-scores or efficiency or accuracy if you do it on standalone instead of web server?
What would be the difference if I use the Web server instead of dedicated server for protein modeling and docking?. Available from: https://www.researchgate.net/post/What_would_be_the_difference_if_I_use_the_Web_server_instead_of_dedicated_server_for_protein_modeling_and_docking [accessed Apr 27, 2017].
There should be no difference in output.*** The big difference is likely going to be in the size of the analysis you can perform, and how fast the analysis is performed. A lot of web-servers have a job size limit to prevent one person from hogging resources unless you buy greater access. On a public web-server, your job is queued with everyone else's. If the server is in high demand, you may have to wait along time for your analysis to kick off.
***: I said there should be no difference in output and that is true if you set up your server the same as the public webserver with the same parameters used for analysis execution. This could be a good or bad thing depending on what you are trying to accomplish. Some public webservices do not give you access to all parameters, so setting up your own may give you the ability to tune your analysis to a greater degree.
Chimera or other software to perform protonation states of a protein
I want to perform a Molecular Docking between some ligands and a protein in different pH conditions.
For this, I calculated the distance between atoms in different pH concentrations for the ligands, using the software Avogadro.
I would like to do the same with my protein, that is, get the structure and its protonation state for two different pHs.
Is there any software that allows me to do that?
In order to follow this tutorial you only need a web browser, a text editor, and PyMOL (freely available for most operating systems) on your computer in order to visualize the input and output data.
Further, the required data to run this tutorial are the same as for the DisVis tutorial and should be downloaded from here. Once downloaded, make sure to unpack the archive.
Also, if not provided with special workshop credentials to use the HADDOCK portal, make sure to register in order to be able to submit jobs. Use for this the following registration page: https://bianca.science.uu.nl/auth/register/haddock.
HADDOCK general concepts
HADDOCK (see https://www.bonvinlab.org/software/haddock2.2) is a collection of python scripts derived from ARIA (https://aria.pasteur.fr) that harness the power of CNS (Crystallography and NMR System, https://cns-online.org) for structure calculation of molecular complexes. What distinguishes HADDOCK from other docking software is its ability, inherited from CNS, to incorporate experimental data as restraints and use these to guide the docking process alongside traditional energetics and shape complementarity. Moreover, the intimate coupling with CNS endows HADDOCK with the ability to actually produce models of sufficient quality to be archived in the Protein Data Bank.
A central aspect to HADDOCK is the definition of Ambiguous Interaction Restraints or AIRs. These allow the translation of raw data such as NMR chemical shift perturbation or mutagenesis experiments into distance restraints that are incorporated in the energy function used in the calculations. AIRs are defined through a list of residues that fall under two categories: active and passive. Generally, active residues are those of central importance for the interaction, such as residues whose knockouts abolish the interaction or those where the chemical shift perturbation is higher. Throughout the simulation, these active residues are restrained to be part of the interface, if possible, otherwise incurring in a scoring penalty. Passive residues are those that contribute for the interaction, but are deemed of less importance. If such a residue does not belong in the interface there is no scoring penalty. Hence, a careful selection of which residues are active and which are passive is critical for the success of the docking.
The docking protocol of HADDOCK was designed so that the molecules experience varying degrees of flexibility and different chemical environments, and it can be divided in three different stages, each with a defined goal and characteristics:
1. Randomization of orientations and rigid-body minimization (it0)
In this initial stage, the interacting partners are treated as rigid bodies, meaning that all geometrical parameters such as bonds lengths, bond angles, and dihedral angles are frozen. The partners are separated in space and rotated randomly about their centers of mass. This is followed by a rigid body energy minimization step, where the partners are allowed to rotate and translate to optimize the interaction. The role of AIRs in this stage is of particular importance. Since they are included in the energy function being minimized, the resulting complexes will be biased towards them. For example, defining a very strict set of AIRs leads to a very narrow sampling of the conformational space, meaning that the generated poses will be very similar. Conversely, very sparse restraints (e.g. the entire surface of a partner) will result in very different solutions, displaying greater variability in the region of binding.
2. Semi-flexible simulated annealing in torsion angle space (it1)
The second stage of the docking protocol introduces flexibility to the interacting partners through a three-step molecular dynamics-based refinement in order to optimize interface packing. It is worth noting that flexibility in torsion angle space means that bond lengths and angles are still frozen. The interacting partners are first kept rigid and only their orientations are optimized. Flexibility is then introduced in the interface, which is automatically defined based on an analysis of intermolecular contacts within a 5A cut-off. This allows different binding poses coming from it0 to have different flexible regions defined. Residues belonging to this interface region are then allowed to move their side-chains in a second refinement step. Finally, both backbone and side-chains of the flexible interface are granted freedom. The AIRs again play an important role at this stage since they might drive conformational changes.
3. Refinement in Cartesian space with explicit solvent (water)
The final stage of the docking protocol allows to immerse the complex in a solvent shell to improve the energetics of the interaction. HADDOCK currently supports water (TIP3P model) and DMSO environments. The latter can be used as a membrane mimic. In this short explicit solvent refinement the models are subjected to a short molecular dynamics simulation at 300K, with position restraints on the non-interface heavy atoms. These restraints are later relaxed to allow all side chains to be optimized. In the 2.4 version of HADDOCK, the explicit solvent refinement is replaced by default by a simple energy minimisation as benchmarking has shown it does not add much to the quality of the models. This allows to save time.
The performance of this protocol depends on the number of models generated at each step. Few models are less probable to capture the correct binding pose, while an exaggerated number will become computationally unreasonable. The standard HADDOCK protocol generates 1000 models in the rigid body minimization stage, and then refines the best 200 (ranked based on the HADDOCK score) in both it1 and water. Note, however, that while 1000 models are generated by default in it0, they are the result of five minimization trials and for each of these the 180 degrees symmetrical solution is also sampled. Effectively, the 1000 models written to disk are thus the results of the sampling of 10.000 docking solutions.
The final models are automatically clustered based on a specific similarity measure - either the positional interface ligand RMSD (iL-RMSD) that captures conformational changes about the interface by fitting on the interface of the receptor (the first molecule) and calculating the RMSDs on the interface of the smaller partner, or the fraction of common contacts (current default) that measures the similarity of the intermolecular contacts. For RMSD clustering, the interface used in the calculation is automatically defined based on an analysis of all contacts made in all models.
The new 2.4 version of HADDOCK also allows to coarse grain the system, which effectively reduces the number of particles and speeds up the computations. We are using for this the MARTINI2.2 force field, which is based on a four-to-one mapping of atoms on coarse-grained beads.
AutoDock is a suite of automated docking tools. It is designed to predict how small molecules, such as substrates or drug candidates, bind to a receptor of known 3D structure.
Current distributions of AutoDock consist of two generations of software: AutoDock 4 and AutoDock Vina.
AutoDock 4 actually consists of two main programs: autodock performs the docking of the ligand to a set of grids describing the target protein autogrid pre-calculates these grids.
In addition to using them for docking, the atomic affinity grids can be visualised. This can help, for example, to guide organic synthetic chemists design better binders.
AutoDock Vina does not require choosing atom types and pre-calculating grid maps for them. Instead, it calculates the grids internally, for the atom types that are needed, and it does this virtually instantly.
We have also developed a graphical user interface called AutoDockTools, or ADT for short, which amongst other things helps to set up which bonds will treated as rotatable in the ligand and to analyze dockings.
- X-ray crystallography
- structure-based drug design
- lead optimization
- virtual screening (HTS)
- combinatorial library design
- protein-protein docking
- chemical mechanism studies.
AutoDock 4 is free and is available under the GNU General Public License. AutoDock Vina is available under the Apache license, allowing commercial and non-commercial use and redistribution. Click on the "Downloads" tab. And Happy Docking!
What is AutoDock Vina?
AutoDock Vina is a new generation of docking software from the Molecular Graphics Lab. It achieves significant improvements in the average accuracy of the binding mode predictions, while also being up to two orders of magnitude faster than AutoDock 4. 1
Because the scoring functions used by AutoDock 4 and AutoDock Vina are different and inexact, on any given problem, either program may provide a better result.
Detailed information can be found on the AutoDock Vina web site.
AutoDock 4.2 is faster than earlier versions, and it allows sidechains in the macromolecule to be flexible. As before, rigid docking is blindingly fast, and high-quality flexible docking can be done in around a minute. Up to 40,000 rigid dockings can be done in a day on one cpu.
AutoDock 4.2 now has a free-energy scoring function that is based on a linear regression analysis, the AMBER force field, and an even larger set of diverse protein-ligand complexes with known inhibition constants than we used in AutoDock 3.0. The best model was cross-validated with a separate set of HIV-1 protease complexes, and confirmed that the standard error is around 2.5 kcal/mol. This is enough to discriminate between leads with milli-, micro- and nano-molar inhibition constants.
You can read more about the new features in AutoDock 4.2 and how to use them in the AutoDock4.2 User Guide.
AutoDock 4 is Free Software
The introduction of AutoDock 4 comprises three major improvements:
- The docking results are more accurate and reliable.
- It can optionally model flexibility in the target macromolecule.
- It enables AutoDock's use in evaluating protein-protein interactions.
AutoDock 4.0 not only is it faster than earlier versions, it allows sidechains in the macromolecule to be flexible. As before, rigid docking is blindingly fast, and high-quality flexible docking can be done in around a minute. Up to 40,000 rigid dockings can be done in a day on one cpu.
AutoDock 4.0 now has a free-energy scoring function that is based on a linear regression analysis, the AMBER force field, and an even larger set of diverse protein-ligand complexes with known inhibiton constants than we used in AutoDock 3.0. The best model was cross-validated with a separate set of HIV-1 protease complexes, and confirmed that the standard error is around 2.5 kcal/mol. This is enough to discriminate between leads with milli-, micro- and nano-molar inhibition constants.
You can read more details about the new features in AutoDock4.2 User Guide.
AutoDock 4.0 can be compiled to take advantiage of new search methods from the optimization library, ACRO, developed by William E. Hart at Sandia National Labs. We have also added some new features to our existing evolutionary methods. We still provide the Monte Carlo simulated annealing (SA) method of 2.4 and earlier. The Lamarckian Genetic Algorithm (LGA) is a big improvement on the Genetic Algorithm, and both genetic methods are much more efficient and robust than SA.
Mailing List and Forum
We have established a mailing list and forum for AutoDock users. Here is more information about the AutoDock List (ADL). URL for the forum is http://mgl.scripps.edu/forum.
What is AutoDockTools (ADT)?
We have developed and continue to improve our graphical front-end for AutoDock and AutoGrid, ADT (AutoDockTools). It runs on Linux, Mac OS X, SGI IRIX and Microsoft Windows. We also have new tutorials, along with accompanying sample files.
Where is AutoDock Used?
AutoDock has now been distributed to more than 29000 users around the world. It is being used in academic, governmental, non-profit and commercial settings. In January of 2011, a search of the ISI Citation Index showed more than 2700 publications have cited the primary AutoDock methods papers.
AutoDock is now distributed under the GPL open source license and is freely available for all to use. Because of the restrictions of incorporating GPL licensed software into other codes for the purpose of redistribution, some companies may wish to license AutoDock under a separate license agreement - which we can arrange. Please contact Prof. Arthur J. Olson at + 1 (858) 784-2526 for more information.
Why Use AutoDock?
AutoDock has been widely-used and there are many examples of its successful application in the literature (see References) in 2006, AutoDock was the most cited docking software. It is very fast, provides high quality predictions of ligand conformations, and good correlations between predicted inhibition constants and experimental ones. AutoDock has also been shown to be useful in blind docking, where the location of the binding site is not known. Plus, AutoDock is free software and version 4 is distributed under the GNU General Public License it easy to obtain, too.
Run Your AutoDock Research Project on World Community Grid!
Does your research run on AutoDock? If so, you may be eligible to benefit from World Community Grid&rsquos free computational power to accelerate your research. AutoDock has already been &ldquogrid-enabled&rdquo by World Community Grid&rsquos technical team and is run on World Community Grid with the following projects:
- [email protected] project from The Scripps Research Institute.
project from The University of Texas Medical Branch.
Please review World Community Grid&rsquos research project criteria and contact World Community Grid if you have an idea for a project proposal or any questions.
File format Converters
- . Free open source chemical expert system mainly used for converting chemical file formats. For Windows, Unix, and Mac OS.
. Generates 3D structures for small and medium sized, drug-like molecules. Distributed by Molecular Networks.
. Universal organic chemistry toolkit, containing tools for end users, as well as a documented API for developers. Free and open-source, but also available on a commercial basis. Distributed by GGA software.
. Command-line molecule and reaction rendering utility. Free and open source. Distibuted by GGA software.
. Command-line canonical SMILES generator. Free and open source. Distibuted by GGA software.
. Command-line program for R-Group deconvolution. Free and open source. Distibuted by GGA software.
. (Conformer Ensembles Containing Bioactive Conformations). Converts from 1D or 2D to 3D using distance bounds methods, with a focus on reproducing the bioactive conformation. Developed by OpenEye.
. (COordinates of Small MOleculeS). High-throughput method to predict the 3D structure of small molecules from their 1D/2D representations. Also exists as a web service. Provided by the University of california, Irvine.
. Generate and analyse 3D conformers of small molecules. TorsionAnalyzer is based on an expert-derived collection of SMARTS patterns and rules (assigned peaks and tolerances). Rules result from statistical analysis of histograms derived from small molecule X-ray data extracted from the CSD. Rotatable bonds of molecules loaded into the TorsionAnalyzer are color-coded on the fly by means of a traffic light highlighting regular, borderline and unusual torsion angles. This allows the user to see at a glance if one or more torsion angles are out of the ordinary. Provided by BioSolveIT.
. 2D to 3D structure conversions, including tautomeric, stereochemical, and ionization variations, as well as energy minimization and flexible filters to generate ligand libraries that are optimized for further computational analyses. Distributed by Schrodinger.
. Universal scriptable toolkit for chemical information processing. Used by PubChem. Maintained and distributed by Xemistry. Free for academic.
. Indigo-based utility for finding duplications and visual comparison of two files containing multiple structures. SDF, SMILES, CML, MOLFILE input formats are supported. Files can contains large amount of molecules and ChemDiff was test on files with up to 1 million ones. Free and open-source. Distributed by GGA software.
. (Optical Structure Recognition Application). Utility designed to convert graphical representations of chemical structures, as they appear in journal articles, patent documents, textbooks, trade magazines etc. OSRA can read a document in any of the over 90 graphical formats parseable by ImageMagick - including GIF, JPEG, PNG, TIFF, PDF, PS etc., and generate the SMILES or SDF representation of the molecular structure images encountered within that document. Free and open source. Developed by the Frederick National Laboratory for Cancer Research, NIH.
. Collection of Perl scripts, modules, and classes to support day-to-day computational chemistry needs. Free software, open source. Provided by Manish Sud.
. Engine module of VLifeMDS containing basic molecular modeling capabilities such as building, viewing, editing, modifying, and optimizing small and arge molecules. Fast conformer generation by systematic and Monte-carlo methods. Provided by VLife.
. (Structure PrOtonation and REcognition System). Structure recognition tool for automated protein and ligand preparation. SPORES generates connectivity, hybridisation, atom and bond types from the coordinates of the molecule`s heavy atoms and hydrogen atoms to the structure. The protonation can either be done by just adding missing hydrogen atoms or as a complete reprotonation. SPORES is able to generate different protonation states, tautomers and stereoisomers for a given structure. Developed by the Konstanz university.
. Program to generate 3D conformation of small molecules using Distance Geometry and Automated Molecular Mechanics Optimization for in silico Screening. Freely distributed by the University of Paris Diderot.
. Molecular modeling tool to convert 2D structures (chemical structural formula) of compounds drawn by ISIS-Draw or ChemDraw to 3D structures with additional information on atomic charge etc. Distributed by IMMD.
. A software suite for drawing chemical structure diagrams, including the ability to calculate NMR spectra, generate IUPAC names and line notations for structures, manipulate structures imported from the Internet, interpret and interconvert files generated by other chemical drawing software programs, illustrate glassware and equipment setups, and draw TLC plates. Distributed by iChemLabs LLC.
. Software for searching and analyzing the conformational space of small and large molecules.
. Cheminformatics library mainly used for conversion of file formats. Written in Java. For Windows, Unix, and Mac OS.
. LGPL-ed library for bio- and cheminformatics and computational chemistry written in Java. Opensource.
. .NET Cheminformatics Toolkit completely built on Microsoft .NET platform. By using Mono, MolEngine can run on other platform, such as Mac, Linux, iPad. Distributed by Scilligence.
. Universal organic chemistry toolkit. Free and opensource. Provided by GGA.
. Indigo-based utility for finding duplications and visual comparison of two files containing multiple structures. SDF, SMILES, CML, MOLFILE input formats are supported. Provided by GGA.
. ODDT is a free and open source tool for both computer aided drug discovery (CADD) developers and researchers. It reimplements many state-of-the-art methods, such as machine learning scoring functions (RF-Score and NNScore) and wraps other external software to ease the process of developing CADD pipelines. ODDT is an out-of-the-box solution designed to be easily customizable and extensible. Therefore, users are strongly encouraged to extend it and develop new methods. Provided by the Institute of Biochemistry and Biophysics PAS, Warsaw, Poland.
. Collection of cheminformatics and machine-learning software written in C++ and Python.
. Molecule file manipulation and conversion program.
. Molecule file manipulation and conversion program.
. KNOwledge-Driven Ligand Extractor is a software library for the recognition of atomic types, their hybridization states and bond orders in the structures of small molecules. Its prediction model is based on nonlinear Support Vector Machines. The process of bond and atom properties perception is divided into several steps. At the beginning, only information about the coordinates and elements for each atom is available: (i) Connectivity is recognized (ii) A search of rings is performed to find the Smallest Set of Smallest Rings (SSSR) (iii) Atomic hybridizations are predicted by the corresponding SVM model (iv) Bond orders are predicted by the corresponding SVM model (v) Aromatic cycles are found and (vi) Atomic types are set in obedience to the functional groups. Some bonds are reassigned during this stage. Linux and MacOS version are free of charge. Maintained by the Nano-D team, Inria/CNRS Grenoble, France.
. Consists of two programs that can be used to convert one or more SMILES strings to 3D. For Mac and Linux. Also exists as a web service.
. JAVA-based software tool for exploring the chemical space by enabling generation of and navigation in a scaffold tree hierarchy annotated with various data. The graphical visualization of structural relationships allows to analyze large data sets, e.g., to correlate chemical structure and biochemical activity. Free open source software developed and supported by the Chair of algorithm Engineering at Technical University Dortmund and the Department of Chemical Biology at Max-Planck Institute for Molecular Physiology Dortmund.
. Java-based program which generates the scaffold tree database independently of Scaffold Hunter. Free open source software developed and supported by the Chair of algorithm Engineering at Technical University Dortmund and the Department of Chemical Biology at Max-Planck Institute for Molecular Physiology Dortmund.
. Program to extract scaffolds from organic drug-like molecules by 'stripping' away sidechains and representing the remaining structure in a condensed form. Open source software distributed by Silicos.
. Free and open source python script that can decompose PDBs of small-molecule compounds into their constituent fragments. Developed by the National Biomedical Computation Resource.
. Enumerates ligand protonation states and tautomers in biological conditions. Distributed by Schrodinger.
. iBabel is an alternative graphical interface to Open Babel for Macintosh OS X.
. Collection of perl modules providing objects and methods for representing molecules, atoms, and bonds in Perl doing substructure matching and reading and writing files in various formats.
. The purpose of this SDF toolkit is to provide functions to read and parse SDFs, filter, and add/remove properties.
Sequence identity is a simple yet reasonably accurate predictor of docking success
The analysis of all sequence- and structure-based indices showed that none performs significantly better than the others. Sequence identity and similarity perform equally well (correlation coefficients of 0.70 and 0.69, respectively) and are trivial to calculate, requiring no further information than the pairwise alignment. Interestingly, the sequence identity at the interface is only a marginally better predictor (correlation coefficient of 0.71), which suggests that the overall fold of the molecule is relevant for a good arrangement of the interface and thus for the success of the docking.
Structure-based indices show a rather heterogeneous performance. The QMean, 34 Molprobity, 44 and Verify3D 35 metrics all evaluate the structural properties of the model, such as amino acid packing, distribution of torsion angles, etc. (Supporting Information Table S1). Since the homology models undergo a slight refinement, it is not expected that they have severe clashes or other deviant structural features. Nevertheless, Molprobity was very discriminative of native structures, attributing to these very low scores (almost always below 15 a.u.) in contrast to scores above 70 for the majority of the homology models. The scoring between the homology models was, however, heterogeneous and did not correlate with the docking results. Finally, the backbone iRMSD between model and template, a direct structural comparison measure, showed the highest correlation coefficient, on par with TVSMod_RMSD (0.73), and better than the overall structural similarity between the two structures (0.56).
The quality of the interaction restraints has a greater impact than the quality of the homology model
Information-driven docking narrows the conformational landscape of association of the molecules to the fraction that respects that information. Furthermore, if the information is integrated in the energy function used in refinement (i.e., not only for scoring), there is an added benefit of driving the interface refinement. Our results are in agreement with these assumptions, since docking calculations using literature-based information [CAPRI restraints, Fig. 2(B,D)] show worse results than those using true interface restraints [Fig. 2(C,E)]. The impact of the quality of the restraints is illustrated in the runs of T18, where precision and recall values were extremely low and the models were accordingly of bad quality (iRMSD over 4 Å). Overall however, despite starting the modeling process with templates as low as 20% sequence identity, the docked models are still quite reasonable (within 3 Å iRMSD), provided that the interaction information is reliable. This thus stresses the importance of the quality of the data over that of the model. The scoring of the models, helped by the interface information, is also robust enough to discriminate good quality models, regardless of the identity of the template used in the homology modeling. This again reinforces the notion that the quality of the data is more important than that of the model, since good data can refine a bad model and discriminate which solutions are closer to the native structure, while weak data pollute the docking protocol even when the model quality is reasonable.
Defining the limits of homology modeling in information-driven docking
On the basis of these observations, we can predict the quality of information-driven docking predictions given the sequence identity of the templates used to build the homology models (Fig. 3). Assuming reliable interface information, a homology model built with a template sharing 20% sequence identity can be expected to produce docking models within 4 Å iRMSD of the native complex. As the target-template identity increases, so does the expected quality of the final models. For example, most of the 60% identity models produced docking solutions around 2 Å iRMSD. This is likely to represent an overestimate of the achievable quality since one of the docking partners was taken in its bound form. Still, it is striking to see that the recent CAPRI targets, which were all homology–homology or homology–unbound docking cases, nicely follow the trend line of our model. This would indicate that the achievable docking quality is limited by the lowest sequence identity component of the interaction partners—in other words: the worse approximation defines the limits of your model.
The reliability of the information is of course hard to estimate. During a CAPRI round, most of the information is gathered from literature databases and bioinformatics predictions in the 24-hour period that comprises the server submission. All in all, this essentially means that reliable information is not so scarce as one might imagine. Finally, the homology modeling approach used in this study is standard, not using advanced refinement methods such as those available in structure prediction servers. 17, 45 As such, the presented results can be considered a baseline, which can be further improved by expert knowledge of the system under study and/or more powerful structure prediction methods.
ASSESSMENT PROCEDURES AND CRITERIA
The standard CAPRI assessment protocol
The predicted homo and heterocomplexes were assessed by the CAPRI assessment team, using the standard CAPRI assessment protocol, which evaluates the correspondence between predicted complex and the target structure. 18, 19
This protocol (summarized in Fig. 1) first defines the set of residues common to all the submitted models and the target, so as to enable the comparison of residue-dependent quantities, such as the root mean square deviation (rmsd) of the models versus the target structure. Models where the sequence identity to the target is too low are not assessed. The threshold is determined on a per-target basis, but is typically set to 70%.
Schematic illustration of the CAPRI assessment criteria. The following quantities were computed for each target: (1) all the residue-residue contacts between the Receptor (R) and the Ligand (L), and (2) the residues contributing to the interface of each of the components of the complex. Interface residues were defined on the basis of their contribution to the interface area, as described in references. 18, 19 For each submitted model the following quantities were computed: the fractions f(nat) of native and f(non-nat) of non-native contacts in the predicted interface the root mean square displacement (rmsd) of the backbone atoms of the ligand (L-rms), the mis-orientation angle θL and the residual displacement dL of the ligand center of mass, after the receptor in the model and experimental structures were optimally superimposed. In addition we computed I-rms, the rmsd of the backbone atoms of all interface residues after they have been optimally superimposed. Here the interface residues were defined less stringently on the basis of residue-residue contacts (see Refs. 18, 19 ).
The set of common residues is used to evaluate the two main rmsd-based quantities used in the assessment: the ligand rmsd (L-rms) and the interface rmsd (I-rms). L-rms is the backbone rmsd over the common set of ligand residues after a structural superposition of the receptor. I-rms is the backbone rmsd calculated over the common set of interface residues after a structural superposition of these residues. An interface residue is defined as such when any of its atoms (hydrogens excluded) are found within 10 Å of any of the atoms of the binding partner.
An important third quantity whereby models are assessed is f(nat), representing the fraction of native contacts in the target, that is, reproduced in the model. This quantity takes all the protein residues into account. A ligand-receptor contact is defined as any pair of ligand-receptor atoms within 5 Å distance. Atomic contacts below 3 Å are considered as clashes predictions with too many clashes are disqualified. The clash threshold varies with the target and is defined as the average number of clashes in the set of predictions plus two standard deviations. The quantities f(nat), L-rms and I-rms together determine the quality of a predicted model, and based on those three parameters models are ranked into four categories: High quality, medium quality, acceptable quality and incorrect, as summarized in Table 3.
|***||High||≥ 0.5||≤ 1.0||OR||≤ 1.0|
|**||Medium||≥ 0.3||< 1.0–5.0]||OR||< 1.0–2.0]|
|*||Acceptable||≥ 0.1||< 5.0–10.0]||OR||< 2.0–4.0]|
|Incorrect||< 0.1||> 10.0||AND||> 4.0|
Applying the CAPRI assessment protocol to homo-oligomers
Evaluating models of homo and heteroprotein complexes against the corresponding target structure is a well-defined problem when the target complex is unambiguously defined, for example, if the target association mode and corresponding interface represents the biologically relevant unit. This is usually, although not always, the case for binary heterocomplexes, but was not the situation encountered in this experiment for the homo-oligomer targets. All except two of the 25 targets for which predictions were evaluated here represent homo-oligomers. For about half of these targets the oligomeric state was deemed unreliable, as it was either only inferred computationally from the crystal structure using the PISA software 23 or because the authors' assignment and inferred oligomeric states, although available, were inconsistent (Table 1). Only about 15 targets had an oligomeric state assigned by the authors at the time of the experiment.
To address this problem in the assessment, the PISA software was used to generate all the crystal contacts for each target and to compute the corresponding interface areas. The interfaces were then ranked according to size of the interface. In candidate dimer targets, submitted models were usually evaluated against 1 or 2 of the largest interfaces of the target, and acceptable or better models for any or all of these interfaces were tallied. For candidate tetramer targets, the relevant largest interfaces for each target were identified in the crystal structure, and predicted models were evaluated by comparing in turn each pair of interacting subunits in the model to each of the relevant pairs of interacting subunits in the target (Supporting Information Fig. S1), and again the best predicted interfaces were retained for the tally. One of the two bonafide heterocomplexes was also evaluated against multiple interfaces.
Evaluating the accuracy of the 3D models of individual subunits
Since this experiment was a close collaboration between CAPRI and CASP, the quality of the 3D models of individual subunits in the predicted complexes was assessed by the CASP team using the LGA program, 35 which is the basic tool for model/target comparison in CASP. 36, 37 The tool can be run in two evaluation modes. In the sequence-dependent mode, the algorithm assumes that each residue in the model corresponds to a residue with the same number in the target, while in the sequence-independent mode this restriction is not applied. The program searches for optimal superimpositions between two structures at different distance cutoffs and returns two main accuracy scores GDT_TS and LGA_S. The GDT_TS score is calculated in the sequence-dependent mode and represents the average percentage of residues that are in close proximity in two structures optimally superimposed using four selected distance cutoffs (see Ref. 38 for details). The LGA_S score is calculated in both evaluation modes and represents a weighted sum of the auxiliary LCS and GDT scores from the superimpositions built for the full set of distance cutoffs (see Ref. 35 for details). We have run the evaluation in both modes, but since the CAPRI submission format permits different residue numbering, we used the LGA_S score from the sequence-independent analysis as the main measure of the subunit accuracy assessment. This score is expressed on a scale from 0 to 100, with 100 representing a model that perfectly fits the target. The rmsd values for subunit models cited throughout the text are those computed by LGA software. We verified that for about 80% of the assessed models the GDT-TS and LGA-S scores differed by <15 units, indicating that these models correspond to near identical structural alignments with the corresponding targets, in line with the fact that the majority of the targets of this Round represent proteins that could be readily modeled by homology. Of the remaining 20% with larger differences between the 2 scores, 18% correspond to disqualified models or incorrect complexes and 2% correspond to acceptable (or higher quality) predicted complexes. Their impact on the analysis is therefore negligible.
Building target models based on the best available templates
In order to better estimate the added value of protein docking procedures and template-based modeling techniques it seemed of interest to build a baseline against which the different approaches could be benchmarked. To this end, the best oligomeric structure template for each target available at the time of the predictions was identified. Based on this template, the target model was built using a standard modeling procedure, and the quality of this model was assessed using the CAPRI evaluation criteria described above.
To identify the templates, the protein structure database “PDB70” containing proteins of mutual sequence identity ≤70% was downloaded from HHsuite. 39 The database was updated twice during the experiment (See Supporting Information Table S5 for the release date of the database used for each target). Only homo-complexes were considered for this analysis.
The best available templates were detected in three different ways and target models were generated from the templates as follows: (1) Detection based on sequence information alone: For each target sequence, proteins related to the target were searched for in the protein structure database by HHsearch 40 in the local alignment mode with the Viterbi algorithm. 41 Among the top 100 entries, up to 10 proteins that are in the desired oligomer state were selected as templates. When more than two assembly structures with different interfaces were identified, the best ranking one was selected as template. The target and template sequences were aligned using HHalign 40 in the global alignment mode with the maximum accuracy algorithm. Based on the sequence alignments, oligomer models were built using MODELLER. 42 The model with the lowest MODELLER energy out of 10 models was selected for further analysis. (2) Detection based on the experimental monomer structure: Proteins with highest structural similarity to the experimental monomer structure were searched for using TM-align. 43 Among the top 100 entries, up to 10 proteins that are in the desired oligomer state were selected as templates as described above. Based on the target-template alignments output by TM-align, models were built using MODELLER, and the lowest energy model was selected as described above. (3) Detection based on the experimental oligomer structure: A similar procedure to those described above was applied. Although this time, the best templates were identified by searching for proteins with the highest structural similarity to the target oligomer structure. The search was performed using the multimeric structure alignment tool MM-align. 44 For computational efficiency, MM-align was applied only to the 100 proteins with the highest monomer structure similarity to the target. Models were built using MODELLER based on the alignment output by MM-align.
Why InterEvDock2 ?
The structural modelling of protein-protein interactions is key in understanding how cell machineries assemble and cross-talk with each other. When homologous sequences are available for both protein partners, it is very useful to rely on structures and multiple sequence alignments to identify binding interfaces. InterEvDock2 is a server for protein docking running the InterEvScore potential specifically designed to integrate evolutionary information in the docking process. The InterEvScore potential was developed for heteromeric protein interfaces and combines a residue-based multi-body statistical potential with evolutionary information derived from the multiple sequence alignments of each partner in the complex. In the InterEvDock2 server, the systematic docking search is performed using the FRODOCK2 program  and the resulting models are re-scored with InterEvScore  together with the SOAP_PP atom-based statistical potential  found to increase the confidence of the predictions.
InterEvDock2 is an update of InterEvDock  that can handle protein sequences as inputs, and not only protein 3D structures. When a sequence is provided by the user, a comparative modeling step based on an automatic template search protocol builds models for the individual protein partners, prior to docking. In InterEvDock2, in case the user has biological input such as a position that is known to be involved in the interface between the two protein partners, constraints can be specified for use in the docking procedure. This can be crucial to ensure that all available biologically relevant information is used for InterEvDock2 predictions. In addition, InterEvDock2 implements the possibility to submit structures of oligomers as input to the free docking. Such an option is generally complicated in co-evolution analyses since the joint MSAs have to be generated for every chain of an oligomer. This process is now fully automatized in InterEvDock2.
When using this service, please cite the following references:
Please, cite also the FRODOCK2 program which is used for the rigid-body docking step:
Using the results of SOAP_PP, please cite :
Using the evolutionary conservation results obtained using Rate4Site (mapped onto all visualized models in the PV applet and written into the b-factor field of the PDB files provided for all models in the results zip archive) please cite:
Using the comparative modeling protocol based on RosettaCM (i.e. if your input consists in one or two sequences), please cite:
Using the automatic template search (i.e. if your input consists in one or two sequences and you did not specify a template), please cite:
When Software Eats Bio
He first joined a16z as our first professor in residence from Stanford University, where he was a professor of chemistry, computer science, and structural biology directed the Program in Biophysics and ran an award-winning distributed computing lab whose work contributed to our understanding of Alzheimer’s, Huntington’s, and various kinds of cancer. Pande also co-founded Globavir BioSciences was an early developer at a video game company that was sold to Sony and is an advisor to numerous IT and bio startups.
a16z: This seems so obvious, but why a bio fund?
Vijay: There’s a couple different ways I think about this. One is that we all care about human health — whether it’s for ourselves, our parents, our children — and it’s a big deal on a very deep, fundamental level, in terms of thinking about the meaning of life. At a much more mundane business level, there’s clearly a huge market opportunity here. Just think of the marketing budget that Google can go after (with ads and such) — $200 billion. But compared to that, the U.S. healthcare budget is $2 trillion! Even tiny little sub-budgets of that are huge markets for startups to go after.
Bio is therefore an area where there’s a real chance to change the world … but also a chance for really great financial returns as well. The firm has been excited about this space for a while, and we’ve made investments before even raising a bio fund. But we want to do something really big, and expect this to grow bigger in time, so establishing a separate fund is also about our thinking years down the road.
a16z: So why now? Areas like healthcare (and education, among a few others) have been impenetrable to disruption, despite periodic claims to the contrary. Such hyperinflated industries have always been ripe for tech, yet they’ve never really been remade with tech. I feel like we keep saying ‘this time is different’.
Vijay: There IS a specific confluence of trends right now. On the computational side things are fundamentally different. Even though Moore’s law made Silicon Valley, we still can’t conceive of how exponentially the cost of computation is going down.
One of my big projects at Stanford, [email protected] , got a Guinness World Record for the most powerful supercomputer in the world it was the first to reach 1 petaflop. But now, that amount of compute power costs $400 a day on Amazon. That sort of “exponential decay” results in declining cost, making what used to be extraordinary and world-record making both average and cheap today.
That’s what’s happening in compute. But there’s also a Moore’s law for storage that’s been exponentially decreasing as well. When you combine this “free” compute and storage and data, you get big data — which machine learning depends on — which in turn leads to deep learning.
a16z: So how do we connect these dots to bio?
Vijay: Bio has its own Moore’s law. Because the cost of sensors are going to zero, the cost of things like genomic sequencing are going to zero. Actually, they’re going to zero faster than with Moore’s law.
The Human Genome Project was set up in 1991 and finished in 2001, for something like $3 billion. Now, it would cost $300. That’s a clear exponential decay in cost. It creates an interesting situation where so much is available to us right now. What’s left is the software to put it all together.
a16z: How can you make the claim that software connects the dots? Because when I think of bio, I think of tissue and flesh I don’t think of computation and algorithms. How do those two actually come together?
Vijay: Let’s take machine learning. You can now do so much with image recognition there. And a big part of medicine involves images. Sure, when you go to your doctor, a bit of listening happens, but most of it is really about analyzing your x-rays (radiology), examining your skin (dermatology), or looking at your eyes (ophthalmology).
Of course, these doctors aren’t just using their eyes they’re applying and honing decades of medical training to do the pattern recognition, which in many cases is very subtle and requires significant expertise. There’s going to be many examples like this where computation can do something beyond what a human being can. It’s not limited to just vision. Think of all the inputs that humans take in with their senses each of those are amenable to machine learning and deep learning: Listening with a stethoscope. Smelling something. And so on.
In many cases, algorithms can do better than humans. Just as computer vision has had a huge impact in non-medical areas, it’s now getting to the point where it can set a new gold standard. If the gold standard in radiology is to predict what radiologists would do, computers can go beyond that. In radiation oncology for example the gold standard would be to predict the biopsy results … without having to actually put the patient through one.
a16z: What you’re describing is essentially disintermediating doctors, isn’t it? What are the implications of that, more broadly?
Vijay: I don’t think the goal here is to take people entirely out of the equation. It’s to help the experts.
Imagine a computer algorithm that does the equivalent of what spellchecking does for writers. Similarly, instead of radiologists having to look at thousands of images, the computer vision algorithm flags only the important ones. Just as with a spellchecker. And maybe you say, wait, that’s not a typo it’s actually someone’s name. But the ultimate judgment is for the human to make.
What I’m describing doesn’t replace all radiologists and other medical specialists it just dramatically speeds up their work and allows them to concentrate on higher-order, more complex, more important things.
a16z: It’s not just about being cheaper and faster, but better.
Vijay: Yes, and what I’ve just described is actually one of the three big areas we are focusing on with this bio fund — “computational biomedicine”.
Because for anything that’s machine learning-based — like image recognition and computer vision as with these examples — the machine learning gets better as the cost of compute and cost of storage goes to zero. But what machine learning really craves is data. And the reason machine learning and medicine is a marriage made in heaven is that medicine has a ton of data. All of which can now be stored, brought to algorithms, and related to later outcomes.
We can even learn new things as a result. It’s amazing: We recently discovered a new piece of human anatomy due to more precise microscopes. I was shocked I thought anatomy is one of those areas we actually had locked down!
So I think taking this data-driven, computational approach to medicine will open up lots and lots of opportunities not just to improve the accuracy and quality of medicine, but to build really big companies as a result.
a16z: This kind of machine learning and big data requires compute and storage. Does this mean we are finally at an AWS-like infrastructure moment for bio startups, much like what happened for web-based startups?
Vijay: It’s an area that we’ve called “cloud biology” or “cloud bio”. Even the name is meant to evoke cloud computing, and all the new businesses that the cloud enables.
But what’s happening here is that real-life, real-world experiments can be done in a cloud-like fashion.
a16z: In a “cloud like fashion” — what does that even mean?
Vijay: Why is the cloud so important to startups? A startup in the software space would have had to spend $10-$20 million to build up a server farm, just to be able to do anything at scale in 2000. And scale is incredibly important because you can’t really prove your product by only running on one or two machines. Cloud computing meant that you could later give a startup $2-$3 million, and before they came back for their next series A investment, they would have a product out there, running, with customers.
You can de-risk early. And that’s a fundamental difference between bio and traditional biotech, where you often had to put in $100 million and then wait five years before there was any sort of signal for whether it was working or not. We can now give computer science grad students or MDs $2-3 million, and they can use cloud bio resources instead of having to build out the lab (which is the analog to building out a server farm).
a16z: Is all this just about achieving product-market fit faster? Or can we do more as a result of cloud computing applied to bio?
Vijay: While cloud computing leads to lower CapEx and often lower operating costs, what’s nice about AWS or other cloud compute services is that if you want to spin up 10,000 cores for five minutes and spin it back down, you can do that. And so with these new cloud bio resources you can spin up experiments, whether it’s in vitro experiments driven by robots, or animal experiments.
So no, it’s not just about cost efficiencies and a more efficient pathway to market fit. You can now also do things that you couldn’t do before. The elasticity the cloud provides to bio is key.
a16z: “Spin up experiments” — I love that turn of phrase. Besides being able to do that, how does cloud bio touch on the issue of reproducibility and accuracy in scientific research? I feel like we’re suddenly seeing a lot more about this lately, even though the problem has been around for ages how does this issue fit in this context?
Vijay: I think we’re seeing a transformation right now, sort of like an Industrial Revolution for biology. If you look at how the state-of-the-art in biology has always been done, it reminds me of something almost pre-Industrial Revolution. It’s rows and rows of people working with their hands at benches, and in an apprenticeship-like way under a master biologist (often a professor).
It’s very difficult to achieve reproducibility of scientific results — which is important for advancing the field and deciding what research paths to pursue — in this context. I mean, even how you pipette can have a huge impact when you’re putting reagents in test tubes! Just two weeks ago, I heard a story where what the grad student ate for lunch changed the results. (Tuna fish put amines on his breath and therefore into the reagents it was something that was very difficult to track.) There’s other stories like that out there too, like laundry dyes in a lab coat, and so on.
a16z: Ok, so all kinds of spurious variables can come in because of this mechanical fallibility. But I still don’t quite get how the computational aspect helps address the problem.
Vijay: So what happens in cloud biology is not purely the computational part, but the fact that computation is driving robots who can do the experiment.
At one of the more exciting companies I’ve seen in this space, when you want to do an experiment, you literally write a computer program. When we say that biology is becoming a software problem, in this case it’s quite literally the case. If you or anyone else wants to reproduce the experiment, you just have to get a copy of the program and rerun it.
a16z: So doing an experiment means running a computer program. Isn’t this what computer simulations and modeling already do for us?
Vijay: These are real-life experiments. Simulations always have to make a trade-off for compute cost versus accuracy that’s the main issue. This is the real thing.
a16z: Ok, you’ve now described two areas we’re focusing on so far with the bio fund — “cloud bio” just now and “computational biomedicine” earlier. Anything else?
Vijay: The other huge area of interest for us is “digital therapeutics”. It’s a term pioneered by Omada [which we’re investors in] and others.
The way I like to think of it is this: If the first phase of medicine was about small-molecule drugs delivered intravenously, and the second phase (then led by biotech companies like Genentech) was about protein biologics, then the third phase is about digital therapeutics.
It seems like the holy grail of medicine is to take a pill, wait a bit, and then get better — just like magic! But there are real limits to this, especially when it comes to depression, PTSD, smoking cessation, type II diabetes, insomnia, and other behavior-mediated conditions.
I’m confident that 10-20 years from now when we look back on this phase of medicine, it’s going to seem backwards and even barbaric that our solution to everything was just giving out pills.
a16z: Are you just describing preventing disease in the first place (vs. treating it after the fact)? Or are you talking about changing habits? What does a digital therapeutic actually do?
Vijay: It very much changes habits. Digital therapeutics treat what are really behavioral problems with a behavioral solution.
To give you an example of what a digital therapeutic would actually do: Let’s say I’m borderline for type II diabetes. I could pay someone $100,000-$200,000 a year to follow me around 24/7, like a personal trainer, making me do pushups to build muscle mass and knocking doughnuts out of my hands every time I reach for one. And sure, that would work. It’s just really expensive for most of us. Behavioral therapies essentially do the equivalent type of motivation and coordination — and still have a human-touch element through coaches, messaging, social networks — but do so in a way that can scale such that costs are dramatically lower.
Because there are existing approaches that have already shown quite good efficacy in this space — they’re just expensive and don’t scale. A great example is Stanford’s sleep clinic or its pediatric obesity clinic, both of which do amazing things but cost a lot and can only take in a small number of (often privileged) people a year. Yet there are millions of people with type II diabetes …it’s an epidemic.
Digital therapeutics allow such successful approaches to become cheaper and to scale. And they have no toxic side effects, which is very appealing from a drug point of view what we don’t like about investing in traditional biotech is the risks due to side effects, additional regulatory issues, and so on.
a16z: You’ve brought up a few times now how what we’re doing with the bio fund is so different from traditional ‘biotech’. Why? How?
Vijay: The bio fund is really about funding software companies in the bio space. Whereas traditional biotech has very little software in it, at its core.
I began by talking about Moore’s law. There is an analogous law on the drug design side, Eroom’s Law, with “Eroom” being “Moore” spelled backwards. Where Moore’s law is about the exponential decrease in cost, Eroom’s Law is about exponential increase in cost. And over the last four decades, drugs have been exponentially increasing in cost.
In terms of our investment thesis, when we said we’re not going to do biotech, we basically said we’re not going to do anything associated with Eroom’s. And we’re still saying that.
a16z: What makes something Moore’s (vs. Eroom’s) Law? How can you tell?
Vijay: Anything that is driven by the declining cost of computation.
Earlier, I mentioned how cloud bio is one of the big differences between traditional biotech and what we think of as bio. So something that’s heavily computer-driven and software-driven will go on the Moore’s law curve.
a16z: Is a natural consequence of this difference between traditional biotech and bio that the ideal entrepreneur for us is not a medical student?
Vijay: No, I don’t think that’s true.
One of the most exciting things about the opportunity right now is that med students 20 years ago were very different than they are now. Today, they’re very computer-savvy some have been programming since they were teenagers. (At Stanford, roughly 80% of students take a programming class.)
Even if they don’t code, these students with medical degrees, these students of biology and chemistry, can talk in a very deep way about computer science. They may not be the CTO, but they have a tech-savvy mindset.
a16z: Flipping the previous question around for a moment, how does it apply to you? You weren’t trained as an MD, right? Does that mean you view bio companies differently than a doctor would?
Vijay: Most current doctors haven’t been immersed in machine learning let alone been researchers. The perspective I’m coming from works because this new wave of bio companies is far more software-like than bio-like — though there is of course bio at the core, too.
When I’m talking to entrepreneurs I like going deep with them not just on protein biology but on machine learning, distributed systems, infrastructure — or even just general issues with healthcare and medicine. These are all things that are very familiar to me and that I have done either as a startup founder or in my 15 years at Stanford and Stanford Medical School.
a16z: Going back to the beginning here — my point about healthcare being resistant to disruption — why can’t the incumbents do some of this if conditions are so different now? They certainly have the data. They know the space inside out. Shouldn’t they have a home-court advantage? This is not a case, as with Google back in the day, where a little startup is coming into a space where no entrenched alternatives existed before. This is a deeply entrenched industry with tentacles everywhere.
Vijay: And yet there are plenty of examples where incumbents just didn’t have it in their cultural corporate DNA to do that something. You could argue that IBM had everything necessary to build a social network. Even Google, once a startup itself, built a social network that didn’t compete nearly as well with Facebook.
It’s about different corporate cultures, different styles, and completely different operations. And bio x computer science is actually something very different. It would be very expensive and difficult for hospitals to do since they don’t have the infrastructure startups have it would be like reinventing the wheel. Not to mention the cultural clash in absorbing the results even if they did get that far.
The existing pharma companies and other incumbents are very good at what they do. But from a tech point of view, healthcare companies and institutions are living in the 1980s — it’s “fax machine medicine”. Just as Google transformed multiple industries, or Uber and Lyft are changing the taxi and car industry, there are similar opportunities here due to tech at the core.
a16z: Clearly the culture and the operations matter. Speaking of, you mentioned the reasons why we’re doing a separate bio fund earlier. But how will that work, logistically?
Vijay: If you’re looking from the outside, from the entrepreneur’s point of view, I don’t think you’d even be able to tell that there is a separate fund in terms of how the firm operates. The separate fund simply emphasizes our dedication and excitement about the space, drawn from LPs who are committed to the vision as well.
But in terms of how pitches and everything we do would go, it’s all the same. We’ve got the full team involved here. Besides myself, there are a number of other people vetting these deals. And then all the other general partners are fully involved too — whether they’re weighing in on unit economics or the marketplace aspects or machine learning or cloud infrastructure or whatnot. The other general partners have a huge set of domain expertise and other experience to contribute, just as they do with all our other companies.
And it’s not all technical expertise, either. It’s taking advantage of their — and our operating teams’ — core competencies in hiring, sales, marketing, SaaS (which is very different from consumer applications), and so on. It’s about building enterprise sales, pricing strategies, etc.
That’s the most exciting thing here, business-wise. Because the reality is that these bio startups look exactly like software companies, especially after they achieve product-market fit and gain their first customers. When I said these companies are more like software companies, I meant it — not just in their tech core but in how to build and scale them.
a16z Podcast: Bio Meets Computer Science with Marc Andreessen, Chris Dixon, Vijay Pande [listen]
WFReDoW: A Cloud-Based Web Environment to Handle Molecular Docking Simulations of a Fully Flexible Receptor Model
1 Laboratório de Bioinformática, Modelagem e Simulação de Biossistemas (LABIO), Faculdade de Informática (FACIN), Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Avenida Ipiranga 6681, Prédio 32, Sala 608, 90619-900 Porto Alegre, RS, Brazil
2 Grupo de Pesquisa em Inteligência de Negócio (GPIN), Faculdade de Informática (FACIN), Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Avenida Ipiranga 6681, Prédio 32, Sala 628, 90619-900 Porto Alegre, RS, Brazil
Molecular docking simulations of fully flexible protein receptor (FFR) models are coming of age. In our studies, an FFR model is represented by a series of different conformations derived from a molecular dynamic simulation trajectory of the receptor. For each conformation in the FFR model, a docking simulation is executed and analyzed. An important challenge is to perform virtual screening of millions of ligands using an FFR model in a sequential mode since it can become computationally very demanding. In this paper, we propose a cloud-based web environment, called web Flexible Receptor Docking Workflow (wFReDoW), which reduces the CPU time in the molecular docking simulations of FFR models to small molecules. It is based on the new workflow data pattern called self-adaptive multiple instances (P-SaMIs) and on a middleware built on Amazon EC2 instances. P-SaMI reduces the number of molecular docking simulations while the middleware speeds up the docking experiments using a High Performance Computing (HPC) environment on the cloud. The experimental results show a reduction in the total elapsed time of docking experiments and the quality of the new reduced receptor models produced by discarding the nonpromising conformations from an FFR model ruled by the P-SaMI data pattern.
Large-scale scientific experiments have an ever-increasing demand for high performance computing (HPC) resources. This typical scenario is found in bioinformatics, which needs to perform computer modeling and simulations on data varying from DNA sequence to protein structure to protein-ligand interactions . The data flood, generated by these bioinformatics experiments, implies that technological breakthroughs are paramount to process an interactive sequence of tasks, software, or services in a timely fashion.
Rational drug design (RDD)  constitutes one of the earliest medical applications of bioinformatics . RDD aims to transform biologically active compounds into suitable drugs . In silico molecular docking simulation is one of the main steps of RDD. It is used to deal with compound discovery, typically by computationally virtual screening a large database of organic molecules for putative ligands that fit into a binding site  of the target molecule or receptor (usually a protein). The best ligand orientation and conformation inside the binding pocket is computed in terms of the free energy of bind (FEB) by software, for instance the AutoDock4.2 .
In order to mimic the natural, in vitro and in vivo, behavior of ligands and receptors, their plasticity or flexibility should be treated in an explicit manner : our receptor is a protein that is an inherently flexible system. However, the majority of molecular docking methods treat the ligands as flexible and the receptors as rigid bodies . In this study we model the explicit flexibility of a receptor by using an ensemble of conformations or snapshots derived from its molecular dynamics (MD) simulations  (reviewed by ). The resulting model receptor is called a fully-flexible receptor (FFR) model. Thus, for each conformation in the FFR model, a docking simulation is executed and analyzed .
Organizing and handling the execution and analysis of molecular docking simulations of FFR models and flexible ligands are not trivial tasks. The dimension of the FFR model can become a limiting factor because instead of performing docking simulations in a single, rigid receptor conformation, we must carry out this task for all conformations that make up the FFR model . These conformations can vary in number from thousands to millions. Therefore, the high computing costs involved in using FFR models to perform practical virtual screening of thousands or millions of ligands may make it unfeasible. For this reason, we have been developing methods to simplify or reduce the FFR model dimensionality [6, 9, 10]. We named this simpler representation of an FFR model a reduced fully flexible receptor (RFFR) model. An RFFR model is achieved by eliminating redundancy in the FFR model through clustering its set of conformations, thus generating subgroups, which should contain the most promising conformations .
To address these key issues, we propose a cloud-based web environment, called web Flexible Receptor Docking Workflow (wFReDoW), to fast handle the molecular docking simulations of FFR models. To the best of our knowledge, it is the first docking web environment that reduces both the dimensionality of FFR models and the overall docking execution time using an HPC environment on the cloud. The wFReDoW architecture contains two main layers: Server Controller and (flexible receptor middleware) FReMI. Server Controller is a web server that prepares docking input files and reduces the size of the FFR model by means of the self-adaptive multiple instances (P-SaMIs) data pattern . FReMI handles molecular docking simulations of FFR models integrated with an HPC environment on Amazon EC2 resources .
There are a number of approaches that predict ligand-receptor interactions on HPC environments using AutoDock4.2 . Most of them use the number of ligands to distribute the tasks among the processors. For instance, DOVIS 2.0  uses a dedicated HPC Linux cluster to execute virtual screening where ligands are uniformly distributed on each CPU. VSDocker 2.0  and Mola  are other examples of such systems. Whilst VSDocker 2.0 works on multiprocessor computing clusters and multiprocessor workstations operated by a Windows HPC Server, Mola uses AutoDock4.2 and AutoDock Vina to execute the virtual screening of small molecules on nondedicated compute clusters. Autodock4.lga.MPI  and mpAD4  use another approach to enhance the performance. As well as the docking parallel execution, Autodock4.lga.MPI and mpAD4 reduce the quantity of network I/O traffic during the loading of grid maps at the beginning of each docking simulation. Another approach is the AutoDockCloud . This is a high-throughput screening of parallel docking tasks that uses the open source Hadoop framework implementing the MapReduce paradigm for distributed computing on a cloud platform using AutoDock4.2 . Although every one of these environments reduces the overall elapsed time of the molecular docking simulations, they only perform docking experiments with rigid receptors. Conversely, wFReDoW applies new computational techniques [6, 10, 11, 18] to reduce the CPU time in the molecular docking simulations of FFR models using public databases of small molecules, such as ZINC .
In this work we present the wFReDoW architecture and its execution. From the wFReDoW executions we expect to find better ways to reduce the total elapsed time in the molecular docking simulations of FFR models. We assess the gains in performance and the quality of the results produced by wFReDoW using a small FFR model clustered by data mining techniques, a ligand from ZINC database , different P-SaMI parameters , and an HPC environment built on Amazon EC2 . Thus, from the best results obtained, we expect that future molecular docking experiments, with different ligands and new FFR models, will use only the conformations that are significantly more promising  in a minimum length of time.
2.1. The Docking Experiments with an FFR Model
To perform molecular docking simulations we need a receptor model, a ligand, and docking software. We used as receptor the enzyme 2-trans-enoyl-ACP (CoA) reductase (EC 188.8.131.52) known as InhA from Mycobacterium tuberculosis . The FFR model of InhA was obtained from a 3,100 ps (1 picosecond = 10 −12 second) MD simulation described in , thus making an FFR model with 3,100 conformations or snapshots. In this study, for each snapshot in the FFR model, a docking simulation is executed and analyzed. Figure 1 illustrates the receptor flexibility.
The ligand triclosan (TCL400 from PDB ID: 1P45A)  was docked to the FFR model. We chose TCL from the referred crystal structure because it is one of the simplest inhibitors cocrystallized with the InhA enzyme. Figure 2 illustrates the reference position of the TCL400 ligand into its binding site (PDB ID: 1P45A) and the position of the TCL ligand after an FFR InhA-TCL molecular docking simulation.
For docking simulations, we used the AutoDock Tools (ADT) and AutoDock4.2 software packages . Input coordinate files for ligand and the FFR model of InhA were prepared with ADT as follows.
Receptor preparation. A PDBQT file for each snapshot from the FFR model was generated employing Kollman partial atomic charges for each atom type.
Flexible ligand preparation. The TCL ligand was initially positioned in the region close to its protein binding pocket and allowed two rotatable bonds.
Reference ligand preparation. This is the ideal position and orientation of the ligand that is expected from docking simulations. A TCL reference ligand was also prepared using the coordinates of the experimental structure (PDB ID: 1P45A). It is called the reference ligand position.
Grid preparation. For each snapshot a grid parameter file (GPF) was produced with box dimensions of
. The other parameters maintained the default values.
Docking parameters. Twenty-five Lamarckian genetic algorithm (LGA) independent runs were executed for each docking simulation. The LGA search method and parameters were: a population size of 150 individuals, a maximum of 250,000 energy evaluations and 27,000 generations. The other docking parameters were kept at default values.
2.2. Reducing the Fully Flexible Receptor Model
The snapshots of the FFR model used in this study are derived from an MD simulation trajectory of the receptor. Even though this approach is considered the best to mimic the natural behavior of ligands and receptors , its dimension or size may become a limiting factor. Moreover, the high computing cost involved could also make the practical virtual screening of such receptor models unfeasible. For these reasons, new methods have been developed to assist in the simplification or reduction of an FFR model to an RFFR model. The primary rationale of this approach is to eliminate redundancy in the FFR model through clustering of its constituent conformations . This is followed by the generation of subgroups with the most promising conformations via the P-SaMI data pattern .
2.2.1. Clusters of Snapshots from an FFR Model
The clusters of snapshots used in this study were generated using clustering algorithms with different similarity functions developed by [6, 7]. Basically, in this approach, our FFR model was used to find patterns that define clusters of snapshots with similar features. In this sense, if a snapshot is associated with a docking with significantly negative FEB, for a unique ligand, it is possible that this snapshot will interact favorably with structurally similar ligands . As a consequence, the clusters of snapshots, which were related to different classes of FEB values, are postprocessed using the P-SaMI data pattern to select the receptor conformations and, thus, to reduce the complexity of the FFR model.
2.2.2. P-SaMI Data Pattern for Scientific Workflow
P-SaMI is the acronym for pattern-self-adaptive multiple instances—a data pattern for scientific workflows developed by . The purpose of this approach is to define a data pattern which is able to dynamically perform the selection of the most promising conformations from clusters of similar snapshots. As shown in Figure 3, the P-SaMI first step is to capture a clustering of snapshots from . Next, P-SaMI divides each cluster into subgroups of snapshots to progressively execute autogrid4 and autodock4 for each conformation that makes up the FFR model using an HPC environment. The results (docking results) are the best FEB value for each docked snapshot. From these results, P-SaMI uses previous FEB results (evaluation criteria) to determine the status and priority of the subgroups of snapshots. Status denotes whether a subgroup of snapshots is active (A), finalized (F), discarded (D), or with changed priority (P). Priority indicates how promising the snapshots are belonging to that subgroup, on a scale of 1 to 3 (1 being the most promising). Thus, if the docking results of a subgroup present an acceptable value of FEB then that subgroup is credited with a high priority. Conversely, the subgroup has its priority reduced or its status changed to “D” and is discarded, unless all the snapshots of that subgroup have already been processed (status “F”).
The reason for using P-SaMI in this work is to make full use of its data pattern to eliminate the exhaustive execution of docking simulations of an FFR model without affecting its quality [6, 10] from clusters of snapshots produced by [6, 7] as input files. In this sense, we make use of a web server environment, herein called server controller, to perform the P-SaMI data pattern and a middleware (FReMI) to handle promising snapshots and send them to an HPC environment on the cloud to execute the molecular docking simulations.
2.3. HPC on Amazon EC2 Instances
Cloud computing is a new promising trend for delivering information technology services as computing utilities . Commercial cloud services can play an attractive role in scientific discovery because they provide computer power on demand over the internet, instead of several commodity computers connected by a fast network. Our virtual HPC environment on Amazon EC2 was built using the GCC 4.6.2 and MPICH2 based on a master-slave paradigm . It contains 5 High-CPU extra large (c1.xlarge) EC2 Amazon instances, each equipped with 8 cores with 2.5 EC2 computer units, 7 GB of RAM, and 1,690 GB of local instance storage. A rating of one EC2 computer units is a unit of CPU capacity which corresponds to 1.0–1.2 GHZ 2007 Opteron or 2007 Xeon processor.
Figure 4 shows the cluster pool created on Amazon EC2’s instances where the same files directory is shared by network file system (NFS) among the instances to store all input and output files used during run time of FReMI. In this pool, all data are stored on the Elastic Block Store (EBS) of the master machine and all the instances have permission to read and write in this shared directory, even if a slave instance terminates. However, if the master instance terminates, all data are lost because the master instance EBS volume terminates at the same time. Thus, the S3cmd source code (S3cmd is an open source project available under GNU Public License v2 and free for commercial and private use. It is a command line tool for uploading, retrieving, and managing data in Amazon’s S3. S3cmd is available at http://s3tools.org/s3cmd) and package is used to replicate the most important information from Amazon EC2 to Amazon S3 bucket (bucket is the space to store data on Amazon S3. Each bucket is identified with a unique bucket name).
The results are aimed at showing the wFReDoW architecture and validating its execution using clusters of snapshots of a specific FFR model against a single ligand. From these results we try to evidence that the proposed cloud-based web environment can be more effective than other methods used to automate molecular docking simulations with flexible receptors, such as . In this sense we divided our results into three parts. Firstly, we present the wFReDoW conceptual architecture to get a better understanding about its operation. Next, a set of experiments is examined to discover the best FReMI performance on Amazon EC2 Cloud. Finally, the new RFFR models are presented by means of the wFReDoW execution.
3.1. wFReDoW Conceptual Architecture
This section presents the wFReDoW conceptual architecture (Figure 5) which was developed to speed up the molecular docking simulations for clusters of the FFR model’s conformations. wFReDoW contains two main layers: Server Controller and FReMI. Server Controller is a web workflow based on P-SaMI data pattern that prepares Autodock input files and selects promising snapshots through docked snapshots. FReMI is a middleware based on the many-task computing (MTC)  paradigm that handles high-throughput docking simulations using an HPC environment built on Amazon EC2 instances. In our study, MTC is used to address the problem of executing multiple parallel tasks in multiple processors. Figure 5 details the wFReDoW conceptual architecture with its layers and interactions. The wFReDoW components are distributed in three layers: Client, Server Controller and FReMI.
3.1.1. Client Layer
The Client layer is a web interface used by the scientist to configure the environment. It initializes the wFReDoW execution and analyzes information about the molecular docking simulations. Client is made up of three main components: (i) Setup component sets up the whole environment before starting the execution (ii) Execute starts the wFReDoW execution and (iii) Analyze shows the provenance of each docking experiment. The communication between Client and Server Controller is done by means of Ajax (http://api.jquery.com/category/ajax/).
3.1.2. Server Controller
Server Controller is a web workflow environment that aids in the reduction of the execution time of molecular docking simulations of FFR models by means of P-SaMI data pattern. It was built using the web framework FLASK 0.8 (http://flask.pocoo.org/) and the Python 2.6.6 libraries. The Server Controller central role is to select promising subgroups of snapshots from an FFR model based on the P-SaMI data pattern . It contains three components: Configuration, Molecular Docking, and P-SaMI. The Configuration component only stores data sent from Setup (Client layer).
The Molecular Docking component manages the P-SaMI input files and performs the predocking steps required for AutoDock4.2 . Firstly, the Prepare Files activity reads the clustering of snapshots generated by  and stores them in the Database. Next, the Prepare Receptor and Prepare Ligand activities generate the PDBQT files used as input files to autogrid4 and autodock4. Finally, the Prepare Grid and Prepare Docking activities create the input files according to the autogrid4 and autodock4 parameters, respectively.
After all files have been prepared by the Molecular Docking component, the P-SaMI component is invoked. This identifies the most promising conformations using the P-SaMI data pattern  from different clusters of snapshots of an FFR model identified by . The P-SaMI component contains three activities: Uploader, Data Analyzer, and Provenance.
Uploader starts the FReMI execution and generates subgroups from snapshot clustering . These subgroups are stored in an XML file structure, called wFReDoW control file (Figure 6). The wFReDoW control file is sent to the Parser/Transfer component (within FReMI) before starting the wFReDoW execution. It contains three root tags described as: experiment, subgroup, and snapshot. The experiment identification (id) is a unique number created for each new docking experiment with an FFR model and one ligand. The subgroup tag specifies the information of the subgroups. The stat and priority tags indicate how promising the snapshots belonging to that subgroup are, according to the rules of the P-SaMI data pattern. The snapshot tag contains information about the snapshots and is used by FReMI to control the docked snapshots.
The Data Analyzer activity examines the docking results, which are sent from FReMI by HTTP Post, based on P-SaMI data pattern. The result of these analyses is a parameter set that is stored in the wFReDoW update files (Figure 7). Thus, to keep FReMI updated with the P-SaMI results, Data Analyzer sends wFReDoW update files to FReMI by SFTP protocol every time P-SaMI modifies the priority and/or status of a subgroup of snapshots.
The Database component is based on FReDD database , built with PostgreSQL 4.2 (http://www.postgresql.org/docs/9.0/interactive/), and is used to provide provenance about data generated by Server Controller. The Provenance activity stores the Server Controller data in the Database component. Hence, the scientist is able to follow wFReDoW execution whenever he/she needs.
3.1.3. FReMI: Flexible Receptor Middleware
FReMI is a middleware on the Amazon Cloud  that handles many tasks to execute, in parallel, the molecular docking simulations of subgroups of conformations of FFR models. It also provides the interoperability between the Server Controller layer and the virtual HPC environment built using the Amazon EC2 instances. FReMI contains five different components: Start, wFReDoW Repository, FReMI workspace, FReMI execution, and HPC environment. Start begins the execution of FReMI and HPC Environment denotes the virtual cluster on EC2 instances. The remaining components are described below.
The wFReDoW Repository contains the Input/Update Files repository. This repository stores all files sent by Server Controller layer using the SFTP network protocol. It consists of predocking files, a wFReDoW control file (Figure 6), and different wFReDoW update files (Figure 7).
The FReMI Workspace component represents the directory structure used to store the huge volume of data manipulated to execute the molecular docking simulations. The input files placed in the wFReDoW Repository are transferred, during FReMI’s execution time, to its workspace by the Parser/Transfer activity within the FReMI Execution set of activities.
The FReMI Execution component—the engine of FReMI—contains every procedure invoked to run the middleware. Its source code was written in the C programming language and its libraries. Figure 8 shows the data flow control followed by the FReMI Execution component. Basically, the FReMI Execution identifies the active snapshots (status A), inserts them in queues of balanced tasks that are created based on subgroup priorities emerging from the P-SaMI data pattern, and submits these queues into the HPC environment. These actions are performed through three activities: Create Queue, Parser/Transfer, and Dispatcher/Monitor.
The Create Queue activity produces a number of queues of balanced tasks during FReMI run time based on the information from wFReDoW control file (Figure 6). According to the priorities, this activity uses a heuristic function to determine how many processors from HPC environment will be allocated for each subgroup of snapshots. Furthermore, it uses the status to identify whether a snapshot should be processed or not. For this purpose, the Create Queue activity starts calculating the maximum number of snapshots that each queue can support. Thus, the amount of nodes or machines allocated (
) and the amount of parallel tasks (
) executed per node are used to obtain the queue length (
), with the following equation:
Afterward, the amount of snapshots per subgroup is calculated in order to achieve the balanced distribution of tasks in every queue created. A balanced queue contains one or more snapshots of an active group. From the subgroup priorities, it is possible to determine the percentage of snapshots to be included in the queues. Thus, subgroups with higher priority are queued before those with lower priority. Equation (2) is used to calculate the amount of snapshots for a balanced queue:
is the amount of snapshots of the subgroup that are placed in the queue. is the queue length from (1). is the priority of the subgroup , and is the sum of the priorities of all subgroups. From (2) one queue of balanced tasks (
) is created with the following equation:
The Parser/Transfer activity handles and organizes the files sent by the Server Controller layer to its workspace on FReMI. It has three functions: to transfer all files received from Server Controller to the FReMI workspace by means of the transfer file function (see Figure 8) to perform a parse on predocking files in order to recognize the FReMI’s files directory structure and to update the parameters of the subgroups of snapshots, when necessary, using the get files function. The purpose of this last activity is to maintain FReMI updated with the Server Controller layer.
The functions from the Dispatcher/Monitor activity, as shown in Figure 8, are invoked to distribute tasks among the processors/cores from the virtual computer cluster on EC2 Amazon  based on the master-slave paradigm . Slave Function only runs the tasks while Master Function, aside from running tasks, also performs two other functions: distribute tasks, which is activated when a node/machine asks for more work and request queue, which is activated when the queue of tasks is empty. Furthermore, to take advantage of the multiprocessing of each virtual machine, we use the hybrid parallel programming model . This model sends bags of tasks among the nodes by means of MPI and it shares out the tasks inside every node by OpenMP parallelization.
3.2. FReMI-Only Execution on Amazon EC2 MPI Cluster
The purpose of executing this set of experiments is to obtain the best MPI/OpenMP performance in the HPC environment on Cloud, which reduces the total elapsed time in the molecular dockings experiments, in order to become the reference to the wFReDoW experiments. For this reason, we have processed the TCL ligand (TCL400 from PDB ID: 1P45A) with two rotatable bonds against all 3,100 snapshots that make up the FFR model using FReMI-only execution. The HPC environment was executed on a scale of 1 to 8 EC2 instances. The number of tasks executed per instance was 32 (from (1): ), and the size of the queues of balanced tasks ranged according to the number of instances used. The performance of each FReMI-only experiment versus the number of cores used is shown in Figure 9.
The performance gain obtained using the virtual MPI/OpenMP cluster on Amazon EC2 is substantial when compared to the serial version. We observed that the serial version, which was performed using only one core from an EC2 instance, took around 4 days to execute all 3,100 snapshots from the FFR model, and its parallel execution decreased this time by over 92% for the scales of cores examined. Even though the overall time of the parallel executions was reduced considerably, we also evaluated the speedup and efficiency in the virtual HPC environment to take further advantage of every core scaled during the wFReDoW execution.
The FReMI-only execution is unable to take advantage of more than 48 cores because its efficiency ranges only from 22% to 29% (see Figure 9). Conversely, the cores were well used during the execution when we used less than 40. As can be seen, the best FReMI-only execution efficiency (i.e., 42%) was achieved using 32 and 40 cores from virtual HPC environment. However, the overall execution time spent between them was 7 hours and 28 minutes for 32 cores against 5 hours and 47 minutes for 40 cores. As a consequence of these assessments, the best FReMI-only configuration found in this set of experiments was 5 c1.xlarge EC2 Amazon instances with 8 cores each. It is worth mentioning that this configuration is able to reduce the total docking experiment time (i.e., 5 hours and 47 minutes) about 94% from its reference serial execution time, which took 90 hours and 47 minutes.
3.3. wFReDoW Execution on Amazon EC2 MPI Cluster
The main goal of this set of experiments is to show the performance gains in the molecular docking simulations of an FFR model and the new flexible models produced using wFReDoW. The wFReDoW experiments were conducted using 3,100 snapshots from an FFR InhA model, which are clustered by similarity functions , and TCL ligand (TCL400 from PDB ID: 1P45A) with two rotatable bonds. We used only an FFR model and a single ligand to evaluate wFReDoW because our goal was to analyze the performance gain in the docking experiments of FFR models by investigating the best way to coordinate, in one unique environment, all the computational techniques, such as data mining , data patterns for scientific workflow , cloud computing , parallel program, web server and the FReMI middleware. This variety of technological approaches contains their particular features and limits that should be dealt with in order to obtain an efficient wFReDoW implementation, avoiding fault communications, overhead, and idleness issues. Thus, from the best results, we expect that future wFReDoW executions may allow practical use of totally fully flexible receptor models playing in virtual screening of thousands or millions of compounds, which are in virtual chemical structures libraries , such as ZINC database .
According to the P-SaMI data pattern, the analyses start after a percentage of snapshots has been docked. In these experiments we seek to know how many snapshots are discarded and what the quality is of the RFFR models which are produced for each clustering when the P-SaMI data pattern starts to evaluate after 30%, 40%, 50%, 70%, and 100% of the docked snapshots. When 100% of snapshots are docked P-SaMI does not analyze the docking results. Thus, we perform fifty different kinds of docking experiments—one P-SaMI configuration for each clustering of snapshots. In this sense, Server Controller prepared three different wFReDoW control files—one for each clustering of snapshots generated by —and four different P-SaMI configurations followed the above mentioned percentage.
Figure 10 summarizes the total execution time and the number of snapshots docked and discarded for each wFReDoW experiment. In this Figure, each graph represents the wFReDoW results obtained by running a P-SaMI configuration for each clustering of snapshots, which are represented by 01, 02, and 03 clustering. Every clustering contains 3,100 snapshots from the FFR model, which are grouped from 4 to 6 clusters depending on the similarity function used by . The total time execution for each experiment (one clustering for one P-SaMI configuration) is calculated from the moment the preparation of the wFReDoW control file (in the Server Controller) begins, until the last docking result comes in the Server Controller.
In this paper we presented the roles of wFReDoW—a cloud-based web environment to faster execute molecular docking simulations of FFR models—and, through its execution, we showed the RFFR models produced. As can be observed in Figure 10, wFReDoW, as well as creating new RFFR models, also speeds up the docking experiments for all cases due to the reduction of docking experiments provided by the P-SaMI data pattern and the simultaneous docking execution performed by the virtual HPC environment. Although we use a small FFR model and only a single ligand, it is clear to see that wFReDoW is a promising tool to start performing molecular docking simulations for new FFR models even using large libraries of chemical structures for the practice of virtual screening.
4.1. wFReDoW Performance
According to , the earlier the analysis starts (in this case 30%), the larger the quantity of unpromising snapshots that can be recognized and discarded is. Figure 10 evidences this statement. The wFReDoW results show that when P-SaMI data pattern starts the analyses of the FFR model with 30% of docked snapshots, the number of unpromising snapshots discarded is higher. Additionally, as this percentage increases, the number of unpromising docked snapshots increases as well. Consequently, if the number of docked snapshots decreases, the overall execution time also decreases. Thus, considering the best run time of wFReDoW, that is, 3 hours and 54 minutes (Figure 10), the gain achieved by the use of P-SaMI showed a fall of 30% from the FReMI-only overall execution (5 hours and 47 minutes).
Another consideration for wFReDoW performance is that FReMI middleware also runs in local cluster infrastructure. However, the efficiency is not the same. We also executed FReMI only using a sample of snapshots from the FFR InhA model on the Atlantica cluster with the intention to compare the performance gains obtained between the virtual and the local cluster infrastructures (Atlantica cluster consists of 10 nodes connected by a fast network system. Each node contains two CPUs Intel Xeon Quad-Core E5520 2.27 GHZ with Hyper-Threading, and 16 GB of RAM, aggregating 16 cores per node. The cluster is connected by a two-gigabit Ethernet network, one for communication between nodes and another for management. Atlantica cluster supplies high performance computational resources for the academic community.) We made several investigations for different nodes and core scales, even for different numbers of tasks executed per node. At the end we found that, in most cases, Amazon EC2 outperforms the Atlantica cluster. For instance, using the same number of cores from Amazon EC2, that is, 5 nodes with 8 cores each, for a sample of 126 snapshots from the FFR model and 16 tasks executed per instance (from (1): ), the total execution time was 14.94 minutes for the Atlantica cluster and 8.78 minutes for Amazon EC2. Possibly, this performance difference is because we used the Atlantica cluster in a nonexclusive mode, sharing the cluster’s facilities. From this evidence and our previous studies, we concluded that the EC2 configuration bestows itself as a very attractive HPC solution to execute molecular docking simulations of a larger set of snapshots and for different ligands.
4.2. The Quality of the RFFR Models Produced
We showed that the approach used in this study enhances the performance of the molecular docking simulations of FFR models in most cases. However, to make sure that the P-SaMI data pattern selected the best snapshots from the cluster of snapshots used, we verified the quality of the RFFR models built by wFReDoW. Regarding this, we took only the first run of the 25 runs performed by AutoDock 4.2, which contains the best FEB of each docking, to evaluate the produced models. The best docking result of each snapshot was organized according to the percentage of snapshots with the best FEB values in an ascending order (set of best FEB). Then, we investigated if the selected snapshots belonged to the percentage of this set. As a result we obtained the data described in Table 1 with the number of docked snapshots for each set of best FEB and its respective accuracy.
Based on the data illustrated in Table 1 we can observe that wFReDoW worked well for all P-SaMI analyses. This is evidenced from the computed accuracy in the produced RFFR models, which contain more than 94% of its snapshots within the set of best FEB values. In the clustering 02, for instance, when P-SaMI started the analysis in 70%, wFReDoW worked best, selecting 308 of the 310 best ones, 612 of the 620 best ones, and 913 of the 930 best ones. Whilst, when P-SaMI started the analysis in 30% in the same clustering, wFReDoW selected 302 of the 10% best ones, 593 of the 20% best ones, and 871 of the 30% best ones. Even though wFReDoW selected fewer snapshots in the latter P-SaMI configuration, it represents 97.42%, 95.65% and 93.66% of the 10%, 20%, and 30% best FEB, respectively. The difference between the best and worst wFReDoW selections is slight. However, the difference between them of 1 hour in the total wFReDoW execution time (3 hours and 54 minutes for P-SaMI analysis from 30% against 4 hours and 57 minutes for P-SaMI analysis from 70%) could be a good motivation to start the P-SaMI analyses when only 30% of the snapshots have been docked. Consequently, it also is a promising opportunity for reducing the overall execution time and preserving the quality of the models produced.
It is worth mentioning that wFReDoW is only capable of building an RFFR model, without losing the quality of its original model, if the clustering methods used as input data contain high affinity among the produced clusters of snapshots from . This means that wFReDoW, with its features, is always able to improve the performance. However, for improving the quality of the RFFR models produced, the used clustering also needs to be of a high quality.
4.3. Amazon Cloud
The most significant advantage of shared resources is the guaranteed access time of the resources wherever you are and whenever you need. There is no competition or restrictions for access to the machines. However, it is necessary to pay for as many computing nodes as needed, which are charged at an hourly rate. The rate is calculated for what resources are being used and when for example, if you do not need computing time, you do not need to pay.
The main contribution of our article is wFReDoW, a cloud-based web environment to faster handle molecular docking simulations of FFR models using more than one computational approach cooperatively. wFReDoW includes the P-SaMI data pattern to select promising snapshots and the FReMI middleware that uses an HPC environment on the Amazon EC2 instances to reduce the total elapsed time of docking experiments. The results showed that the best FReMI-only performance decreased the overall execution time by about 94% with its respective serial execution. Furthermore, wFReDoW reduced the total execution time a further 10–30% from FReMI-only best execution without affecting the quality of the produced RFFR models.
There are several possible ways to further improve the efficiency of wFReDoW. One of the biggest limitations for wFReDoW’s performance is that the Server Controller layer runs in a web server located outside of Amazon EC2. Even though we posted all docking input files inside wFReDoW repository (inside FReMI layer) in advance, there are still a large number of files that are transferred during the wFReDoW execution. In this experiment, the time taken to transfer these files was irrelevant since our FFR model holds only 3,100 snapshots. However, when using FFR models with hundreds to thousands of snapshots, the time will be increased significantly. A way to enhance the overall performance is by the use of an EC2 instance to host the Server Controller layer. This would greatly reduce the time taken to transfer the files from Server Controller to FReMI. Furthermore, the Server Controller layer could also send only the docking input files from promising snapshots during the wFReDoW execution, contributing to the reduction in the amount of files transferred and in the overall elapsed time.
wFReDoW was tested with a single ligand and an FFR model containing only 3,100 conformations of InhA generated by an MD simulation. MD simulations are now running on tens to hundreds of nanoseconds for the same model. This could produce FFR models with more than 200,000 snapshots! wFReDoW should be tested with such models. Additionally, it would be interesting to make use of other ligands by means of investigation of public databases of small molecules, such as ZINC .
Conflict of Interests
The authors declare no conflict of interest.
The authors thank the reviewers for their comments and suggestions. This work was supported in part by grants (305984/2012-8 and 559917/2010-4) from the Brazilian National Research and Development Council (CNPq) to Osmar Norberto de Souza and from EU Project CILMI to Duncan D. A. Ruiz. Osmar Norberto de Souza is a CNPq Research Fellow. Renata De Paris was supported by a CNPq M.S. scholarship. FAF was supported by HP-PROFACC M.S. scholarship.
- N. M. Luscombe, D. Greenbaum, and M. Gerstein, “What is bioinformatics? A proposed definition and overview of the field,” Methods of Information in Medicine, vol. 40, no. 4, pp. 346–358, 2001. View at: Google Scholar
- I. D. Kuntz, “Structure-based strategies for drug design and discovery,” Science, vol. 257, no. 5073, pp. 1078–1082, 1992. View at: Google Scholar
- I. M. Kapetanovic, “Computer-aided drug discovery and development (CADDD): in silico-chemico-biological approach,” Chemico-Biological Interactions, vol. 171, no. 2, pp. 165–176, 2008. View at: Publisher Site | Google Scholar
- B. Q. Wei, L. H. Weaver, A. M. Ferrari, B. W. Matthews, and B. K. Shoichet, “Testing a flexible-receptor docking algorithm in a model binding site,” Journal of Molecular Biology, vol. 337, no. 5, pp. 1161–1182, 2004. View at: Publisher Site | Google Scholar
- G. M. Morris, R. Huey, W. Lindstrom et al., “Software news and updates AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility,” Journal of Computational Chemistry, vol. 30, no. 16, pp. 2785–2791, 2009. View at: Publisher Site | Google Scholar
- K. S. Machado, A. T. Winck, D. D. A. Ruiz, and O. Norberto de Souza, “Mining flexible-receptor docking experiments to select promising protein receptor snapshots,” BMC Genomics, vol. 11, supplement 5, article S6, 2010. View at: Publisher Site | Google Scholar
- K. S. Machado, A. T. Wick, D. D. A. Ruiz, and O. Norberto de Souza, “Mining flexible-receptor molecular docking data,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, no. 6, pp. 532–541, 2011. View at: Publisher Site | Google Scholar
- J. H. Lin, A. L. Perryman, J. R. Schames, and J. A. McCammon, “The relaxed complex method: accommodating receptor flexibility for drug design with an improved scoring scheme,” Biopolymers, vol. 68, no. 1, pp. 47–62, 2003. View at: Publisher Site | Google Scholar
- H. Alonso, A. A. Bliznyuk, and J. E. Gready, “Combining docking and molecular dynamic simulations in drug design,” Medicinal Research Reviews, vol. 26, pp. 531–568, 2006. View at: Google Scholar
- P. H࿋ler, P-SaMI: self-adapting multiple instances𠅊 data pattern to scientific workflows (in portuguese: P-SaMI: padrão de múltiplas instâncias autoadaptáveis—um padrão de dados para workflows cientໟicos) [Ph.D. thesis], PPGCC-PUCRS, Porto Alegre, Brasil, 2010.
- R. De Paris, F. A. Frantz, O. Norberto de Souza, and D. D. A. Ruiz, “A conceptual many tasks computing architecture to execute molecular docking simulations of a fully-flexible receptor model,” Advances in Bioinformatics and Computational Biology, vol. 6832, pp. 75–78, 2011. View at: Google Scholar
- X. Jiang, K. Kumar, X. Hu, A. Wallqvist, and J. Reifman, “DOVIS 2.0: an efficient and easy to use parallel virtual screening tool based on AutoDock 4.0,” Chemistry Central Journal, vol. 2, article 18, 2008. View at: Publisher Site | Google Scholar
- N. D. Prakhov, A. L. Chernorudskiy, and M. R. Gainullin, “VSDocker: a tool for parallel high-throughput virtual screening using AutoDock on Windows-based computer clusters,” Bioinformatics, vol. 26, no. 10, pp. 1374–1375, 2010. View at: Publisher Site | Google Scholar
- R. M. V. Abreu, H. J. C. Froufe, M. J. R. P. Queiroz, and I. C. F. R. Ferreira, “MOLA: a bootable, self-configuring system for virtual screening using AutoDock4/Vina on computer clusters,” Journal of Cheminformatics, vol. 2, no. 1, article 10, 2010. View at: Publisher Site | Google Scholar
- B. Collignon, R. Schulz, J. C. Smith, and J. Baudry, “Task-parallel message passing interface implementation of Autodock4 for docking of very large databases of compounds using high-performance super-computers,” Journal of Computational Chemistry, vol. 32, no. 6, pp. 1202–1209, 2011. View at: Publisher Site | Google Scholar
- A. P. Norgan, P. K. Coffman, J. A. Kocher, K. J. Katzman, and C. P. Sosa, “Multilevel parallelization of AutoDock 4.2,” Journal of Cheminformatics, vol. 3, pp. 1–7, 2011. View at: Google Scholar
- S. R. Ellingson and J. Baudry, “High-throughput virtual molecular docking with AutoDockCloud,” Concurrency and Computation: Practice and Experience, 2012. View at: Publisher Site | Google Scholar
- “Amazon elastic compute cloud,” http://aws.amazon.com/ec2/. View at: Google Scholar
- J. J. Irwin and B. K. Shoichet, “ZINC𠅊 free database of commercially available compounds for virtual screening,” Journal of Chemical Information and Modeling, vol. 45, no. 1, pp. 177–182, 2005. View at: Publisher Site | Google Scholar
- M. R. Kuo, H. R. Morbidoni, D. Alland et al., “Targeting tuberculosis and malaria through inhibition of enoyl reductase: compound activity and structural data,” Journal of Biological Chemistry, vol. 278, no. 23, pp. 20851–20859, 2003. View at: Publisher Site | Google Scholar
- E. K. Schroeder, L. A. Basso, D. S. Santos, and O. N. De Souza, “Molecular dynamics simulation studies of the wild-type, I21V, and I16T mutants of isoniazid-resistant Mycobacterium tuberculosis enoyl reductase (InhA) in complex with NADH: toward the understanding of NADH-InhA different affinities,” Biophysical Journal, vol. 89, no. 2, pp. 876–884, 2005. View at: Publisher Site | Google Scholar
- R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility,” Future Generation Computer Systems, vol. 25, no. 6, pp. 599–616, 2009. View at: Publisher Site | Google Scholar
- C. Banino, O. Beaumont, L. Carter, J. Ferrante, A. Legrand, and Y. Robert, “Scheduling strategies for master-slave tasking on heterogeneous processor platforms,” IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 4, pp. 319–330, 2004. View at: Publisher Site | Google Scholar
- K. S. Machado, E. K. Schroeder, D. D. Ruiz, E. M. L. Cohen, and O. Norberto de Souza, “FReDoWS: a method to automate molecular dockings simulations with explicit receptor flexibility and snapshots selection,” BMC Genomics, vol. 12, pp. 2–13, 2011. View at: Google Scholar
- I. Raicu, I. Foster, M. Wilde et al., “Middleware support for many-task computing,” Cluster Computing, vol. 13, pp. 291–314, 2010. View at: Google Scholar
- A. T. Winck, K. S. MacHado, O. Norberto de Souza, and D. D. Ruiz, “FReDD: supporting mining strategies through a flexible-receptor docking database,” Advances in Bioinformatics and Computational Biology, vol. 5676, pp. 143–146, 2009. View at: Publisher Site | Google Scholar
- R. Rabenseifner, G. Hager, and G. Jost, “Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes,” in Proceedings of the 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP '09), pp. 427–436, IEEE Press, Weimar, Germany, February 2009. View at: Publisher Site | Google Scholar
Copyright © 2013 Renata De Paris et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.