We are searching data for your request:
Upon completion, a link will appear to access the found materials.
I am computer scientist working on machine learning methods applied to predicting missing Gene Ontology annotations. In many papers, I wrote that my computational methods are very useful to suggest to the biologists on which gene functions they should address their research, because my software need less resources (a simple laptop connected to a database) and less time (e.g. with a dataset of all the Homo sapiens Gene Ontology annotations: about 3 hours) than their biological experiments.
Now a reviewer asked me to be more precise. Truly, how much does an in vitro experiment to biologically validate a human Gene Ontology annotation run by biologists cost?
How much money, broadly?
How much time?
Wow that can vary from a couple of months and a few thousand dollars to a lifetime and millions of dollars. I can give a couple of examples, but this is an extremely broad question. Wet science is not cheap might be the message here.
Gene Ontology terms are meant to cover all biological roles known for genes. Terms are added and retire regularly. They embody thousands of person-years of work to establish the framework of what is going on in cells. At the same time its a limited vocabulary of gene behavior. It doesn't represent a complete catalog of all known gene behaviors (e.g. in such detail that it could only be the property of a specific gene) and it is also uneven in the level of its detail of its leaf terms. Its an ontology that is manually assembled and edited and therefore subjective.
I just did an overview of GO 1.2. there are 34153 unique GO terms. That's easily 10s of thousands of separate assays, so picking a group of important ones might be useful. There are also lots of orphan GO terms, where only one gene is currently attached to the term.
For GO annotations, a simple and typical case is a metabolic gene which has a high homology to a known gene and you are only trying to verify that biological role, and you have a functioning biological laboratory.
A breakdown might be:
- clone the gene ( $200 to $1000 and up): order primers, use PCR to insert the gene into a plasmid or get the gene synthesized if you have to, transform it into bacteria, ideally sequence it to make sure you have it right.
- purify the protein ( $500 maybe). Run a check culture to see if the protein is produced, then run a full culture with induction, lyse it and then purify on an affinity chromatography column. you may need to run a second column to get the protein pure.
- do a kinetics study with mass spectroscopic validation of the products (minimally about $300). Cost involves reagents - if you have an assay you can use a UV/Vis spectrophotometer this can be simple, if not it can be much more expensive. This part will vary with the enzyme you are looking at.
Hopefully you get all these steps right the first time. Salaries are not included in this cost.
The more complicated the GO term, the more difficult the work will be - they are not all equal. If the protein function can't be verified in isolation, but must be in situ (i.e. in the living organism). Then you start to get into time. What if your gene is involved in protein routing through the golgi apparatus and modulates the function of other proteins for instance? That's going to require a very specialized and difficult set of experiments that should involve someone with experience in that field.
Finding GO annotations which are not clear from sequence similarity - unknown functions or new genes which reflect previously unestablished GO Terms could cost even more and take a good deal longer.
Still, high throughput studies are being done to find and validate electronic GO annotations. And with that scale, the cost per experiment will go down. Take a look at Olga Troyanskaya's research for an example of such an effort.
'Validation' in genome-scale research
The individual 'validation' experiments typically included in papers reporting genome-scale studies often do not reflect the overall merits of the work.
Following the advent of genome sequencing, the past decade has seen an explosion in genome-scale research projects. Major goals of this type of work include gaining an overview of how biological systems work, generation of useful reagents and reference datasets, and demonstration of the efficacy of new techniques. The typical structure of these studies, and of the resulting manuscripts, is similar to that of a traditional genetic screen. The major steps often include development of reagents and/or an assay, systematic implementation of the assay, and analysis and interpretation of the resulting data. The analyses are usually centered on identifying patterns or groups in the data, which can lead to predictions regarding previously unknown or unanticipated properties of individual genes or proteins.
So that the work is not purely descriptive – anathema in the molecular biology literature – there is frequently some follow-up or 'validation', for example, application of independent assays to confirm the initial data, an illustration of how the results obtained apply to some specific cellular process, or the testing of some predicted gene functions. As the first few display items are often schematics, example data, clustering diagrams, networks, tables of P-values and the like, these validation experiments usually appear circa Figure 5 or 6 in a longer-format paper. This format is sufficiently predominant that my colleague Charlie Boone refers to it as "applying the formula". I have successfully used the formula myself for many papers.
My motivation for writing this opinion piece is that, in my own experience, as both an author and a reviewer, the focal point of the review process – and of the editorial decision – seems too often to rest on the quality of the validation, which is usually not what the papers are really about. While it is customary for authors to complain about the review process in general (and for reviewers to complain about the papers they review), as a reader of such papers and a user of the datasets, I do think there are several legitimate reasons why our preoccupation with validation in genomic studies deserves reconsideration.
First, single-gene experiments are a poor demonstration that a large-scale assay is accurate. To show that an assay is consistent with previous results requires testing a sufficiently large collection of gold-standard examples to be able to assess standard measures such as sensitivity, false-positive rate and false-discovery rate. A decade ago, there were many fewer tools and resources available for example, Gene Ontology (GO) did not exist before the year 2000 , and many of the data analysis techniques now in common use were unfamiliar to most biologists. Proving that one could make accurate predictions actually required doing the laboratory analyses. But today, many tools are in place to make the same arguments by cross-validation, which produces all of the standard statistics. It is also (gradually) becoming less fashionable for molecular biologists to be statistical Luddites.
Second, and similarly, single-gene experiments, or illustrations relating to a specific process, do not describe the general utility of a dataset. Many studies have shown (even if they did not emphasize) that specific data types and reagents are more valuable for the study of some things than others. Validation experiments tend to focus on the low-hanging fruit, for instance, functional categories that seem to be yielding the best examples, and the largest numbers. To minimize the ire of my colleagues, I will give an example from my own work. Our first efforts at systematically predicting yeast gene functions from gene-expression data  resulted in more predictions relating to RNA processing than to any other category, and Northern blots are something even my lab can do, so these were the ones we tested. Although we would like to think that the success at validating predictions from other processes will also be as high as our cross-validation predicted, laboratory validation of predictions from only one category does not show that. Moreover, if one is engaged in high-throughput data collection, it is possible to perform a large number of validations, and show only those that work. It is also possible to choose the validation experiments from other screens already in progress, or already done, or even from other labs. I suspect this practice may be widespread.
A third issue is that focus on the validation is often at the expense of a thorough evaluation of the key points of the remainder of the paper. I may be further ruffling the fur of my colleagues here, but I think it is fair to say that a hallmark of the functional genomics/systems biology/network analysis literature is an emphasis on artwork and P-values, and perhaps not enough consideration of questions such as the positive predictive value of the large-scale data. David Botstein has described certain findings as "significant, but not important" – if one is making millions of measurements, an astronomically significant statistical relationship can be obtained between two variables that barely correlate, and an overlap of only one or a few percent in a Venn diagram can be very significant by the widely used hypergeometric test. A good yarn seems to distract us from a thorough assessment of whether statistical significance equates to biological significance, and even whether the main dataset actually contains everything that is claimed.
I'm writing for an issue of Journal of Biology that is about how to make the peer review process easier, but I do believe that papers in our field would be better if referees were allowed and expected (and given time) to look at the primary data, have a copy of the software, use the same annotation indices, and so on, and see whether they can verify the claims and be confident in conclusions that are reached from computational analyses. Even simple reality checks such as comparing replicates (when there are some) are often ignored by both authors and reviewers. I bring this up because one of the major frustrations expressed by a group of around 30 participants at the Computational and Statistical Genomics workshop I attended at the Banff International Research Station last June was the difficulty of reproducing computational analyses in the functional genomics literature. Often, the trail from the primary data to the published dataset is untraceable, let alone the downstream analyses.
Fourth, and finally, the individual validation experiments may not garner much attention, unless they are mentioned in the title, or have appropriate keywords in the abstract. They are rarely as useful as they would be in a paper in which they were explored in more depth and in which the individual hypothesis-driven experiments could be summarized. For instance, a paper we published in Journal of Biology in 2004  described an atlas of gene expression in 55 mouse tissues and cell types. Using SVM (Support Vector Machine) cross-validation scores, we found that, for many GO annotation categories, it was possible to predict which genes were in the category, to a degree that is orders of magnitude better than random guessing, although usually still far from perfect. The most interesting aspect of the study to me was the observation that there is a quantitative relationship between gene expression and gene function not that this was completely unexpected, but it is nice to have experimental evidence to support the generality of one's assumptions. The SVM scores were used mainly to prove the general point, and whether any individual predictions were correct was not the key finding – we knew ahead of time (from the cross-validation results) that most of the individual predictions would not be correct this is the nature of the business. Nonetheless, final acceptance of the manuscript hinged on our being able to show that the predictions are accurate, so at the request of reviewers and editors, we showed that Pwp1 is involved in rRNA biogenesis, as predicted. According to Google Scholar, this paper now has 139 citations, and my perusal of all of them suggests that neither Pwp1 nor ribosome biogenesis is the topic of any of the citing papers. The vast majority of citations are bioinformatics analyses, reviews, and other genomics and proteomics papers, many of them concerning tissue-specific gene expression. Thus, the initial impact appears primarily to have been the proof-of-principle demonstration of the relationship between gene function and gene expression across organs and cell types, and the microarray data themselves. It is the use of genome-scale data and cross-validation that proves the point, not the individual follow-up experiments.
A small survey of my colleagues suggests that many such examples would be found in a more extensive analysis of the literature in functional genomics and systems biology.
For instance, Jason Moffat explained that in the reviews of his 2006 Cell paper describing the RNAi Consortium lentivirus collection , which already contained a screen for alteration of the mitotic index in cultured cells, a major objection was that more work was needed to validate the reagents by demonstrating that the screen would also work in primary cell cultures – which may be true, but so far, even the mitotic index screen seems to have served primarily as an example of what one can do with the collection. The paper has clearly had a major impact: it has 161 citations according to Google Scholar, the vast majority of which relate to use of the RNAi reagents, not any of the individual findings in this paper.
To conclude, I would propose that, as authors, reviewers and editors, we should re-evaluate our notion of what parts of genome-scale studies really are interesting to a general audience, and consider carefully which parts of papers prove the points that are being made. It is, of course, important that papers are interesting to read, have some level of independent validation, and a clear connection to biology. But it seems likely that pioneering reagent and data collections, technological advances, and studies proving or refuting common perceptions will continue to be influential and of general interest, judging by citation rates. As erroneous data or poorly founded conclusions could have a proportionally detrimental influence, we should be making an effort to scrutinize more deeply what is really in the primary data, rather than waiting to work with it once it is published. Conversely, the individual 'validation' studies that occupy the nethermost figures, although contributing some human interest, may be a poor investment of resources, making papers unnecessarily long, delaying the entry of valuable reagents and datasets into the public domain, and possibly distracting from the main message of the manuscript.
This paper presents an application of Fuzzy Clustering of Large Applications based on Randomized Search (FCLARANS) for attribute clustering and dimensionality reduction in gene expression data. Domain knowledge based on gene ontology and differential gene expressions are employed in the process. The use of domain knowledge helps in the automated selection of biologically meaningful partitions. Gene ontology (GO) study helps in detecting biologically enriched and statistically significant clusters. Fold-change is measured to select the differentially expressed genes as the representatives of these clusters. Tools like Eisen plot and cluster profiles of these clusters help establish their coherence. Important representative features (or genes) are extracted from each enriched gene partition to form the reduced gene space. While the reduced gene set forms a biologically meaningful attribute space, it simultaneously leads to a decrease in computational burden. External validation of the reduced subspace, using various well-known classifiers, establishes the effectiveness of the proposed methodology on four sets of publicly available microarray gene expression data.
Many biological processes are modeled using ordinary differential equations (ODEs) that describe the evolution over time of certain quantities of interest. At the molecular level, the variables considered in the models often represent concentrations (or number of molecules) of chemical species, such as proteins and mRNA. Once the pathway structure is known, the corresponding equations are relatively easy to write down using widely accepted kinetic laws, such as the law of mass action or the Michaelis-Menten law.
In general the equations will depend on several parameters. Some of them, such as reaction rates, and production and decay coefficients have a physical meaning. Others might come from approximations or reductions that are justified by the structure of the system and, therefore, they might have no direct biological or biochemical interpretation. In both cases, most of the parameters are unknown. While sometimes it is feasible to measure them experimentally (especially those in the first class), in many cases this is very hard, expensive, time consuming, or even impossible. However, it is usually possible to measure some of the other variables involved in the models (such as abundance of chemical species) using PCR, immunoblotting assays, fluorescent markers, and the like.
For these reasons, the problem of parameter estimation, that is the indirect determination of the unknown parameters from measurements of other quantities, is a key issue in computational and systems biology. The knowledge of the parameter values is crucial whenever one wants to obtain quantitative, or even qualitative information from the models ,.
In the last fifteen years a lot of attention has been given to this problem in the systems biology community. Much research has been conducted on the applications to computational biology models of several optimization techniques, such as linear and nonlinear least-squares fitting , simulated annealing , genetic algorithms , and evolutionary computation ,. The latter is suggested as the method of choice for large parameter estimation problems . Starting with a suitable initial guess, optimization methods search more or less exhaustively the parameter space in the attempt to minimize a certain cost function. This is usually defined as the error in some sense between the output of the model and the data that comes from the experiments. The result is the set of parameters that produce the best fit between simulations and experimental data. One of the main problems associated with optimization methods is that they tend to be computationally expensive and may not perform well if the noise in the measurements is significant.
Considerable interested has also been raised by Bayesian methods , which can extract information from noisy or uncertain data. This includes both measurement noise and intrinsic noise, which is well known to play an important role in chemical kinetics when species are present in low copy numbers . The main advantage of these methods is their ability to infer the whole probability distributions of the parameters, rather than just a point estimate. Also, they can handle estimation of stochastic systems with no substantial modification to the algorithms . The main obstacle to their application is computational, since analytical approaches are not feasible for non-trivial problems and numerical solutions are also challenging due to the need to solve high-dimensional integration problems. Nonetheless, the most recent advancements in Bayesian computation, such as Markov chain Monte Carlo techniques , ensemble methods ,, and sequential Monte Carlo methods that don't require likelihoods , have been successfully applied to biological systems, usually in the case of lower-dimensional problems and/or availability of a relatively high number of data samples. Maximum-likelihood estimation , has also been extensively applied.
More recently, parameter estimation for computational biology models has been tackled in the framework of control theory by using state observers. These algorithms were originally developed for the problem of state estimation, in which one seeks to estimate the time evolution of the unobserved components of the state of a dynamical system. The controls literature on this subject is vast, but in the context of biological or biochemical systems the classically used approaches include Luenberger-like , Kalman filter based, –, and high-gain observers . Other methods have been developed by exploiting the special structure of specific problems . State observers can be employed for parameter estimation using the technique of state extension, in which parameters are transformed into states by suitably expanding the system under study –. In this context extended Kalman filtering , and unscented Kalman filtering  methods have been applied as well.
When the number of unknown parameters is very large, it is often impossible to find a unique solution to this problem. In this case, one finds several sets of parameters, or ranges of values, that are all equally likely to give a good fit. This situation is usually referred to as the model being non identifiable, and it is the one that's most commonly encountered in practice. Furthermore, it is known that a large class of systems biology models display sensitivities to the parameter values that are roughly evenly distributed over many orders of magnitude. Such “sloppiness” has been suggested as a factor that makes parameter estimation difficult . These and similar results indicate that the search for the exact individual values of the parameters is a hopeless task in most cases . However, it is also known that even if the estimation process is not able to tightly constrain any of the parameter values, the models can still be able to yield significant quantitative predictions .
The purpose of the present contribution is to extend the results on parameter estimation by Kalman filtering by introducing a procedure that can be applied to large parameter spaces, can handle sparse and noisy data, and provides an evaluation of the statistical significance of the computed estimates. To achieve this goal, we introduce a constrained hybrid extended Kalman filtering algorithm, together with a measure of accuracy of the estimation process based on a variance test. Furthermore, we show how these techniques together can be also used to address the problem of model selection, in which one has to pick the most plausible model for a given process among a list of candidates. A distinctive feature of this approach is the ability to use information about the statistics of the measurement noise in order to ensure that the estimated parameters are statistically consistent with the available experimental data.
The rest of this paper is organized as follows. In the Methods Section we introduce all the theory associated with our procedure, namely the constrained hybrid extended Kalman filter, the accuracy measure and its use in estimation refinement, and the application to the model selection problem. In the Results Section we demonstrate the procedure on two examples drawn from molecular biology. Finally, in the Discussion Section we summarize the new procedure, we give some additional remarks, and we point out how these findings will be of immediate interest to researchers in computational biology, who use experimental data to construct dynamical models of biological phenomena.
Evolution of an assay
Prior to its implementation in the current laboratory practice, an assay has to proceed through different phases from its development to its final validation (Figure 1). The most suitable conditions for performing the assay are studied during the optimisation phase after which a preliminary validation is performed that enables the establishment of the specific criteria for the actual validation and thorough evaluation of the characteristics of the assay planned to be validated.
The evolutional steps of an assay.
Once an assay is validated for its intended purpose, also the revalidation policy needs to be established. The revalidation of an assay represents a process where either a specific part or a complete validation is repeated due to particular reasons such as predetermined revalidation schedules, failing of one or two parameters in a validation followed by reoptimisation of the method, changes are made in the manufacturing process, product or analytical procedure including also equipment and/or software changes.
As a general rule, assays need to be validated or revalidated:
Before their introduction into routine use
Whenever the conditions change for which the assay has been validated, for example, instrument with different characteristics
Whenever the method is changed, and the change is outside the original scope of the method
Assay development begins as soon as the product is available. The requirements for an assay performance are changing when evolving from a different product developmental phase to another. For example in preclinical and Phase I studies, it may be acceptable to have qualitative or semiquantitative methods like agarose gel electrophoresis for analysing different plasmid isoforms, but when approaching Phase II and III, quantitative chromatographic methods are recommended. The same applies also for the validation policies as such a lighter validation package for qualitative and semiquantitative methods might be acceptable in the early phase of development, but in the further product developmental phases the validation requirements are increasing considerably. Thus, it can be concluded that the different product developmental phases will determine the extent and level of the validation package needed. While it might be enough to validate all assays used for the product release analyses and have preliminary validated assays applicable for products entering Phase I clinical trials, a full validation package is recommended for all assays used in the quality control analyses of the products entering already Phase II/III clinical trials as well as for the assays used in Good Clinical Practice (GCP) studies.
A particular validation challenge is conferred by biological assays, which measure the response of a living cell or organism induced by a drug or other stimulus. Bioassays are generally very laborious and the data obtained within a validation package may contain large amounts of noise. Therefore, these assays require a treatment that differs from a ligand-binding assay such as ELISA or cell surface receptor binding assay. Examples of bioassays include apoptosis, cell proliferation, cell migration, cell secretion, cell stimulation, gene expression and cell function/inhibition assays.
We introduced MLC, a metric learning method for building automatic function predictors from a large collection of expression data. MLC calculates gene co-expression by assigning GO-term-specific weights to each sample. The weights aim at maximizing the co-expression similarity between genes that are annotated with that GO term. In general, training GO-term specific classifiers (also known as the ‘Binary Relevance’ approach in the machine learning literature) has the disadvantage that individual classifiers fail to see the ‘bigger picture’ and cannot exploit the correlations between terms imposed by the ontological structure. Several works on multi-label classification have shown that Binary Relevance performs worse than models that incorporate label correlations ( Li and Zhang, 2014 Suzuki et al., 2001 Tanaka et al., 2015). Despite this, we showed that the weight profiles learned by MLC do correlate with real biological knowledge, such as semantic similarity in the ontology graph and gene annotation similarity, meaning that our method is powerful enough to capture at least some of the label similarities even though it was not exposed to them. Due to the use of the L1 regularization, MLC can also select informative samples by setting the weights of non-informative samples to zero. Moreover, we showed that the samples that are selected come from biological conditions relevant to the GO term in question.
Our method is designed to work well with a GBA approach like the k-NN classifier. This classifier assigns a GO term to a test gene if a large enough fraction of its top co-expressed training genes are annotated with that term. To achieve this, MLC tries to maximize the difference between the average co-expression between gene pairs that are both annotated with the GO term of interest (‘p–p’ pairs) and the average co-expression between gene pairs only one of which is annotated with the term (‘p–n’ pairs). During the training phase, our model ignores gene pairs where neither gene has the term of interest (‘n–n’ pairs). Such pairs could either include two genes that have common GO annotations, but different from the GO term of interest or two genes with completely different annotations. For the first case, one might be tempted to think that the co-expression of such pairs should be high. However, if their common function is different from the term of interest, it is likely that they are correlated for another set of samples than the one related to the GO term of interest and, consequently, are thus uninformative for that GO term. For the second type of ‘n–n’ pairs, the ones that share no annotations whatsoever, it might make sense to want their co-expression to be 0, as they are expected to be dissimilar over any set of samples. However, we decided to ignore these pairs as they do not add any term-specific information, so it is not clear how they will affect the identification of samples specifically relevant for a specific term. This might be problematic as for a negative test gene (i.e. a gene that should not be annotated with the GO term of interest) we cannot exclude that it can be as highly co-expressed to positive as to negative genes, because we did not tune the co-expression values for ‘n–n’ pairs. For very frequent terms with a lot of positive training genes this leads to a lot of false positive predictions, which might explain the poor performance of MLC for frequent terms.
The similarity function that we used as a basis for MLC is the weighted inner product (Sw). We chose this measure because its unweighted version is identical to the unweighted PCC for centered and scaled data, but it has a simpler form which eases the computational burden. The weighted versions of the inner product and PCC are no longer identical, as the data are no longer scaled after weighing the samples. This has as side-effect that the similarity functions that MLC learns are not necessarily in the range [–1, 1], like the PCC. In most cases, their range is much narrower as can be seen in Figure 2b for GO: 1903047. Also, because of the range differences, it is not trivial to compare the similarity of two genes across different GO terms. For the purpose of classification with the k-NN classifier, however, the range of the metric is insignificant (only the relevant rankings are important to find the proper neighborhood).
Our model is more general and not restricted to only the inner product, though. The main idea is to maximize the difference between the similarity of p–p and p–n pairs. This is done by maximizing the t-statistic between the two distributions of similarities. This means that MLC can also be applied to any measure of similarity such as the weighted PCC, weighted Spearman correlation, Euclidean distance etc. Regardless of the chosen metric, the two classes (‘p–p’ and ‘p–n’) do not meet the assumptions for applying Student’s t-test, as the similarity values are neither normally distributed nor independent. This is not an issue, though, because we do not use the t-statistic to compute a P-value (exploit that the t-statistic is distributed according the Student’s t-distribution under these assumptions), but only to quantify the class separability ( Theodoridis and Koutroumbas, 2008). Equivalently, we could have used any other measure of class separability, for instance the Fisher Discriminant Ratio ( Fisher, 1936) or the Davies-Bouldin index ( Davies and Bouldin, 1979).
4.2 Comparison to related methods
Our work validates the observation that PCC is not the optimal co-expression measure for AFP. The MR attempts to obtain more robust and noise-free co-expression values by converting the PCC values into ranks and averaging the reciprocal rankings of two genes ( Obayashi et al., 2018). MLC takes a fundamentally different approach, operating on the sample level rather than the correlation level. First and foremost, as we mentioned above, it removes samples that do not help at discriminating between genes that do or do not perform a certain function. With that MLC gives insight into which samples are important for a given GO term, which subsequently can be used to investigate the expression patterns of the GO term related genes across these samples. Weighing samples differently can also be viewed as a way of denoising. For example, it can compensate for the issue that an expression change of one unit has a different meaning in different samples due to technical variations, such as differences in sequencing depth or sample preparations. Our results have shown that MLC is more beneficial than the MR approach for the more specific—and arguably more useful—GO terms.
A similar method to MLC is GAAWGEFA, which learns a weight for each sample in a dataset and then applies a weighted Pearson correlation. There are two fundamental differences between the two methods. Firstly, GAAWGEFA aims at good protein-centric performance, i.e. it tries to do well on average for all genes and therefore learns only one set of sample weights. On the other hand, MLC aims at maximizing the performance for each GO term individually. Secondly, GAAWGEFA learns the weights using a genetic algorithm. For MLC, we used the inner product, which allowed us to have a simple optimization problem that can be solved very efficiently. Even though MLC has to be run for each term separately, it is still 67% faster than GAAWGEFA and, unlike GAAWGEFA, runs for different GO terms can be carried out in parallel to achieve even greater speed-up. Next to those differences, MLC makes more accurate predictions for rarer terms and provides interpretability of the predictions by examining the term-specific sample weight distributions.
Furthermore, in the context of selecting expression samples a related technique is biclustering. Biclustering is an umbrella term for a diverse set of algorithms that simultaneously select subsets of genes and samples, so that the genes in the same subset (bicluster) have similar expression to each other within the samples of that bicluster. It is typically expected that each bicluster reflects a biological process and that makes the rationale of MLC appear similar to a biclustering approach. Although both approaches make use of sample selection and aim at discovering genes involved in the same biological processes, they are fundamentally different in the sense that MLC is supervised and biclustering unsupervised. Biclustering does not make use of GO annotations, but only of the expression matrix. Often, observing enrichment of certain GO terms or KEGG pathways in the genes of biclusters is one of the ways to validate a biclustering result ( Santamaría et al., 2007). On the other hand, MLC starts with a set of genes whose GO annotations are known (or at least partly known) and tries to use the expression matrix in order to identify which of the remaining genes participate in a particular biological process by defining a co-expression measure specific to that process.
4.3 Possible extensions
MLC learns the sample weights automatically from the available data and does not rely on information about the samples’ biological condition or tissue. As curation efforts increase and the amount of well-annotated data in public databases grows larger with time, in the future it might be useful to extend MLC to incorporate such knowledge. A possible way to do that would be a group LASSO approach ( Yuan and Lin, 2006). Group LASSO uses predefined groups of samples and forces the weights of all samples in a group to be equal. Each such group could contain technical and biological replicates, samples from the same tissue or samples from similar knockout experiments and perturbations.
A disadvantage of MLC is the fact that it does not account for the possibility that genes that show exactly opposite expression patterns (i.e. genes with large negative correlation) might also be involved in the same biological process. In fact, negative correlations are penalized, as our model explicitly tries to force the signed similarities of p–p pairs to be larger than those of the p–n pairs. In Figure 2b, we see that large negative PCC values are scarce within our dataset, implying that we might not suffer a lot from ignoring negative similarities, at least when considering the PCC as a method. The effect might be larger for our MLC approach though, which selects a subset of the samples, as in this smaller set negative correlations might be more frequent.
To handle this short-coming, one could directly use the absolute value of the weighted co-expression in the model. Doing this, one is faced with an additional challenge, namely that the absolute value is not differentiable at 0. This can be overcome by approximating the absolute value with a smooth function, such as x 2 + ϵ , where ϵ is a small positive number ( Ramírez et al., 2013), but this makes the co-expression function non-linear and the calculation of its derivative with respect to w more costly. More importantly, it makes the optimization problem more difficult, as it adds an extra non-linearity to an already non-convex objective function, meaning that it might be harder to find a good solution for the weights in this problem.
One could also think of alternative formulations of the objective function that would accommodate absolute co-expression values more easily. For instance, it would be possible to minimize the squared difference between the weighted absolute correlations and a target value (e.g. 0 for p–n pairs and 1 for p–p pairs). Another possibility would be to use a triplet loss, which has been successfully used in image retrieval ( Husain et al., 2019). In the triplet loss, we look at sets of three genes at a time instead of two: two positive genes (p1, p2) and one negative (n1). Then, we maximize the difference S w ( p 1 , p 2 ) − S w ( p 1 , n 1 ) , where Sw is in this case the absolute weighted similarity.
Finally, in this work, we applied MLC on finding candidate genes for GO terms from the BPO. However, it can be useful for any gene annotation problem that can be solved with expression data, such as finding members of KEGG pathways or genes that are likely to influence a given phenotypic trait. As MLC is computationally efficient, it can easily be applied to a large number of different terms/phenotypes, offering state-of-the-art performance with the added benefit of allowing users to understand which parts of the dataset influence the predictions.
Gene function discovery: New computation model predicts gene function
Scientists have created a new computational model that can be used to predict gene function of uncharacterized plant genes with unprecedented speed and accuracy. The network, dubbed AraNet, has over 19,600 genes associated to each other by over 1 million links and can increase the discovery rate of new genes affiliated with a given trait tenfold. It is a huge boost to fundamental plant biology and agricultural research.
Despite immense progress in functional characterization of plant genomes, over 30% of the 30,000 Arabidopsis genes have not been functionally characterized yet. Another third has little evidence regarding their role in the plant.
"In essence, AraNet is based on the simple idea that genes that physically reside in the same neighborhood, or turn on in concert with one another are probably associated with similar traits," explained corresponding author Sue Rhee at the Carnegie Institution's Department of Plant Biology. "We call it guilt by association. Based on over 50 million scientific observations, AraNet contains over 1 million linkages of the 19,600 genes in the tiny, experimental mustard plant Arabidopsis thaliana. We made a map of the associations and demonstrated that we can use the network to propose that uncharacterized genes are linked to specific traits based on the strength of their associations with genes already known to be linked to those characteristics."
The network allows for two main types of testable hypotheses. The first uses a set of genes known to be involved in a biological process such as stress responses, as a "bait" to find new genes ("prey") involved in stress responses. The bait genes are linked to each other based on over 24 different types of experiments or computations. If they are linked to each other much more frequently or strongly than by chance, one can hypothesize that other genes that are as well linked to the bait genes have a high probability of being involved in the same process. The second testable hypothesis is to predict functions for uncharacterized genes. There are 4,479 uncharacterized genes in AraNet that have links to ones that have been characterized, so a significant portion of all the unknowns now get a new hint as to their function.
The scientists tested the accuracy of AraNet with computational validation tests and laboratory experiments on genes that the network predicted as related. The researchers selected three uncharacterized genes. Two of them exhibited phenotypes that AraNet predicted. One is a gene that regulates drought sensitivity, now named Drought sensitive 1 (Drs1). The other regulates lateral root development, called Lateral root stimulator 1 (Lrs1). The researchers found that the network is much stronger forecasting correct associations than previous small-scale networks of Arabidopsis genes.
"Plants, animals and other organisms share a surprising number of the same or similar genes -- particularly those that arose early in evolution and were retained as organisms differentiated over time," commented a lead and corresponding author Insuk Lee at Yonsei University of South Korea. "AraNet not only contains information from plant genes, it also incorporates data from other organisms. We wanted to know how much of the system's accuracy was a result of plant data versus non-plant derived data. We found that although the plant linkages provided most of the predictive power, the non-plant linkages were a significant contributor."
"AraNet has the potential to help realize the promise of genomics in plant engineering and personalized medicine," remarked Rhee. "A main bottleneck has been the huge portion of genes with unknown function, even in model organisms that have been studied intensively. We need innovative ways of discovering gene function and AraNet is a perfect example of such innovation.
"Food security is no longer taken for granted in the fast-paced milieu of the changing climate and globalized economy of the 21st century. Innovations in the basic understanding of plants and effective application of that knowledge in the field are essential to meet this challenge. Numerous genome-scale projects are underway for several plant species. However, new strategies to identify candidate genes for specific plant traits systematically by leveraging these high-throughput, genome-scale experimental data are lagging. AraNet integrates all such data and provides a rational, statistical assessment of the likelihood of genes functioning in particular traits, thereby assisting scientists to design experiments to discover gene function. AraNet will become an essential component of the next-generation plant research."
The research is published in the January 31st, advanced on-line Nature Biotechnology and was supported by the Carnegie Institution for Science, the National Research Foundation of Korea, Yonsei University, The National Science Foundation, the National Institutes of Health, and the Packard Foundation.
Materials provided by Carnegie Institution. Note: Content may be edited for style and length.
Research and Publication Ethics
Research Involving Human Subjects
When reporting on research that involves human subjects, human material, human tissues, or human data, authors must declare that the investigations were carried out following the rules of the Declaration of Helsinki of 1975 (https://www.wma.net/what-we-do/medical-ethics/declaration-of-helsinki/), revised in 2013. According to point 23 of this declaration, an approval from an ethics committee should have been obtained before undertaking the research. At a minimum, a statement including the project identification code, date of approval, and name of the ethics committee or institutional review board should be stated in Section ‘Institutional Review Board Statement’ of the article. Data relating to individual participants must be described in detail, but private information identifying participants need not be included unless the identifiable materials are of relevance to the research (for example, photographs of participants’ faces that show a particular symptom). Editors reserve the right to reject any submission that does not meet these requirements.
Example of an ethical statement: "All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of XXX (Project identification code)."
A written informed consent for publication must be obtained from participating patients who can be identified (including by the patients themselves). Patients’ initials or other personal identifiers must not appear in any images. For manuscripts that include any case details, personal information, and/or images of patients, authors must obtain signed informed consent from patients (or their relatives/guardians) before submitting to an MDPI journal. Patient details must be anonymized as far as possible, e.g., do not mention specific age, ethnicity, or occupation where they are not relevant to the conclusions. A template permission form is available to download. A blank version of the form used to obtain permission (without the patient names or signature) must be uploaded with your submission.
You may refer to our sample form and provide an appropriate form after consulting with your affiliated institution. Alternatively, you may provide a detailed justification of why informed consent is not necessary. For the purposes of publishing in MDPI journals, a consent, permission, or release form should include unlimited permission for publication in all formats (including print, electronic, and online), in sublicensed and reprinted versions (including translations and derived works), and in other works and products under open access license. To respect patients’ and any other individual’s privacy, please do not send signed forms. The journal reserves the right to ask authors to provide signed forms if necessary.
The editors will require that the benefits potentially derived from any research causing harm to animals are significant in relation to any cost endured by animals, and that procedures followed are unlikely to cause offense to the majority of readers. Authors should particularly ensure that their research complies with the commonly-accepted '3Rs':
- Replacement of animals by alternatives wherever possible,
- Reduction in number of animals used, and
- Refinement of experimental conditions and procedures to minimize the harm to animals.
Any experimental work must also have been conducted in accordance with relevant national legislation on the use of animals for research. For further guidance authors should refer to the Code of Practice for the Housing and Care of Animals Used in Scientific Procedures .
Manuscripts containing original descriptions of research conducted in experimental animals must contain details of approval by a properly constituted research ethics committee. As a minimum, the project identification code, date of approval and name of the ethics committee or institutional review board should be stated in Section ‘Institutional Review Board Statement’.
Biology endorses the ARRIVE guidelines (www.nc3rs.org.uk/ARRIVE) for reporting experiments using live animals. Authors and reviewers can use the ARRIVE guidelines as a checklist, which can be found at https://arriveguidelines.org/resources/questionnaire.
1. Home Office. Animals (Scientific Procedures) Act 1986. Code of Practice for the Housing and Care of Animals Used in Scientific Procedures. Available online: http://www.official-documents.gov.uk/document/hc8889/hc01/0107/0107.pdf.
Research Involving Cell Lines
Methods sections for submissions reporting on research with cell lines should state the origin of any cell lines. For established cell lines the provenance should be stated and references must also be given to either a published paper or to a commercial source. If previously unpublished de novo cell lines were used, including those gifted from another laboratory, details of institutional review board or ethics committee approval must be given, and confirmation of written informed consent must be provided if the line is of human origin.
An example of Ethical Statements:
The HCT116 cell line was obtained from XXXX. The MLH1 + cell line was provided by XXXXX, Ltd. The DLD-1 cell line was obtained from Dr. XXXX. The DR-GFP and SA-GFP reporter plasmids were obtained from Dr. XXX and the Rad51K133A expression vector was obtained from Dr. XXXX.
Research Involving Plants
Experimental research on plants (either cultivated or wild) including collection of plant material, must comply with institutional, national, or international guidelines. We recommend that authors comply with the Convention on Biological Diversity and the Convention on the Trade in Endangered Species of Wild Fauna and Flora.
For each submitted manuscript supporting genetic information and origin must be provided. For research manuscripts involving rare and non-model plants (other than, e.g., Arabidopsis thaliana, Nicotiana benthamiana, Oryza sativa, or many other typical model plants), voucher specimens must be deposited in an accessible herbarium or museum. Vouchers may be requested for review by future investigators to verify the identity of the material used in the study (especially if taxonomic rearrangements occur in the future). They should include details of the populations sampled on the site of collection (GPS coordinates), date of collection, and document the part(s) used in the study where appropriate. For rare, threatened or endangered species this can be waived but it is necessary for the author to describe this in the cover letter.
Editors reserve the rights to reject any submission that does not meet these requirements.
An example of Ethical Statements:
Torenia fournieri plants were used in this study. White-flowered Crown White (CrW) and violet-flowered Crown Violet (CrV) cultivars selected from ‘Crown Mix’ (XXX Company, City, Country) were kindly provided by Dr. XXX (XXX Institute, City, Country).
Arabidopis mutant lines (SALKxxxx, SAILxxxx,…) were kindly provided by Dr. XXX , institute, city, country).
Clinical Trials Registration
MDPI follows the International Committee of Medical Journal Editors (ICMJE) guidelines which require and recommend registration of clinical trials in a public trials registry at or before the time of first patient enrollment as a condition of consideration for publication.
Purely observational studies do not require registration. A clinical trial not only refers to studies that take place in a hospital or involve pharmaceuticals, but also refer to all studies which involve participant randomization and group classification in the context of the intervention under assessment.
Authors are strongly encouraged to pre-register clinical trials with an international clinical trials register and cite a reference to the registration in the abstract and Methods section. Suitable databases include clinicaltrials.gov, the EU Clinical Trials Register and those listed by the World Health Organisation International Clinical Trials Registry Platform.
Approval to conduct a study from an independent local, regional, or national review body is not equivalent to prospective clinical trial registration. MDPI reserves the right to decline any paper without trial registration for further peer-review. However, if the study protocol has been published before the enrolment, the registration can be waived with correct citation of the published protocol.
MDPI requires a completed CONSORT 2010 checklist and flow diagram as a condition of submission when reporting the results of a randomized trial. Templates for these can be found here or on the CONSORT website (http://www.consort-statement.org) which also describes several CONSORT checklist extensions for different designs and types of data beyond two group parallel trials. At minimum, your article should report the content addressed by each item of the checklist.
Sex and Gender in Research
We encourage our authors to follow the ‘Sex and Gender Equity in Research – SAGER – guidelines’ and to include sex and gender considerations where relevant. Authors should use the terms sex (biological attribute) and gender (shaped by social and cultural circumstances) carefully in order to avoid confusing both terms. Article titles and/or abstracts should indicate clearly what sex(es) the study applies to. Authors should also describe in the background, whether sex and/or gender differences may be expected report how sex and/or gender were accounted for in the design of the study provide disaggregated data by sex and/or gender, where appropriate and discuss respective results. If a sex and/or gender analysis was not conducted, the rationale should be given in the Discussion. We suggest that our authors consult the full guidelines before submission.
Borders and Territories
Potential disputes over borders and territories may have particular relevance for authors in describing their research or in an author or editor correspondence address, and should be respected. Content decisions are an editorial matter and where there is a potential or perceived dispute or complaint, the editorial team will attempt to find a resolution that satisfies parties involved.
MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Publication Ethics Statement
Biology is a member of the Committee on Publication Ethics (COPE). We fully adhere to its Code of Conduct and to its Best Practice Guidelines.
The editors of this journal enforce a rigorous peer-review process together with strict ethical policies and standards to ensure to add high quality scientific works to the field of scholarly publication. Unfortunately, cases of plagiarism, data falsification, image manipulation, inappropriate authorship credit, and the like, do arise. The editors of Biology take such publishing ethics issues very seriously and are trained to proceed in such cases with a zero tolerance policy.
Authors wishing to publish their papers in Biology must abide to the following:
- Any facts that might be perceived as a possible conflict of interest of the author(s) must be disclosed in the paper prior to submission.
- Authors should accurately present their research findings and include an objective discussion of the significance of their findings.
- Data and methods used in the research need to be presented in sufficient detail in the paper, so that other researchers can replicate the work.
- Raw data should preferably be publicly deposited by the authors before submission of their manuscript. Authors need to at least have the raw data readily available for presentation to the referees and the editors of the journal, if requested. Authors need to ensure appropriate measures are taken so that raw data is retained in full for a reasonable time after publication.
- Simultaneous submission of manuscripts to more than one journal is not tolerated.
- Republishing content that is not novel is not tolerated (for example, an English translation of a paper that is already published in another language will not be accepted).
- If errors and inaccuracies are found by the authors after publication of their paper, they need to be promptly communicated to the editors of this journal so that appropriate actions can be taken. Please refer to our policy regarding Updating Published Papers.
- Your manuscript should not contain any information that has already been published. If you include already published figures or images, please obtain the necessary permission from the copyright holder to publish under the CC-BY license. For further information, see the Rights and Permissions page.
- Plagiarism, data fabrication and image manipulation are not tolerated.
- Plagiarism is not acceptable in Biology submissions.
Plagiarism includes copying text, ideas, images, or data from another source, even from your own publications, without giving any credit to the original source.
Reuse of text that is copied from another source must be between quotes and the original source must be cited. If a study's design or the manuscript's structure or language has been inspired by previous works, these works must be explicitly cited.
If plagiarism is detected during the peer review process, the manuscript may be rejected. If plagiarism is detected after publication, we may publish a correction or retract the paper.
Irregular manipulation includes: 1) introduction, enhancement, moving, or removing features from the original image 2) grouping of images that should obviously be presented separately (e.g., from different parts of the same gel, or from different gels) or 3) modifying the contrast, brightness or color balance to obscure, eliminate or enhance some information.
If irregular image manipulation is identified and confirmed during the peer review process, we may reject the manuscript. If irregular image manipulation is identified and confirmed after publication, we may correct or retract the paper.
Our in-house editors will investigate any allegations of publication misconduct and may contact the authors' institutions or funders if necessary. If evidence of misconduct is found, appropriate action will be taken to correct or retract the publication. Authors are expected to comply with the best ethical publication practices when publishing with MDPI.
Authors should ensure that where material is taken from other sources (including their own published writing) the source is clearly cited and that where appropriate permission is obtained.
Authors should not engage in excessive self-citation of their own work.
Authors should not copy references from other publications if they have not read the cited work.
Authors should not preferentially cite their own or their friends’, peers’, or institution’s publications.
Authors should not cite advertisements or advertorial material.
In accordance with COPE guidelines, we expect that “original wording taken directly from publications by other researchers should appear in quotation marks with the appropriate citations.” This condition also applies to an author’s own work. COPE have produced a discussion document on citation manipulation with recommendations for best practice.
Additional file 1
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Biological computation and computational biology: survey, challenges, and discussion
Biological computation involves the design and development of computational techniques inspired by natural biota. On the other hand, computational biology involves the development and application of computational techniques to study biological systems. We present a comprehensive review showcasing how biology and computer science can guide and benefit each other, resulting in improved understanding of biological processes and at the same time advances in the design of algorithms. Unfortunately, integration between biology and computer science is often challenging, especially due to the cultural idiosyncrasies of these two communities. In this study, we aim at highlighting how nature has inspired the development of various algorithms and techniques in computer science, and how computational techniques and mathematical modeling have helped to better understand various fields in biology. We identified existing gaps between biological computation and computational biology and advocate for bridging this gap between “wet” and “dry” research.
This is a preview of subscription content, access via your institution.