We are searching data for your request:
Upon completion, a link will appear to access the found materials.
I have a quick question: what does microarray experiment validation mean ?
I was reading a paper in which they say that the data of the experiment show that the 3 genes radB, dp1 and dp2 are co-regulated following gamma irradiation "validating our microarray experiment".
I tried to find something in Google but it is not yet clear.
The main advantage of an oligonucleotide microarray is in its ability to detect multiple and novel pathogens within a single test. Technical improvements in non-specific amplification and removal of host RNA improve sensitivity, a major limiting factor when direct comparison is made with technologies such as virus-specific reverse transcription- polymerase chain reaction (RT-PCR). Furthermore, improved probe design software has also contributed to a higher sensitivity, and has simplified the design of oligonucleotides in response to the emergence of new pathogens. Addition of new probes to a high-density microarray is now rapid and relatively inexpensive, allowing timely response to changes in disease epidemiology.
For lyssaviruses, current gold standard tests do not discriminate between different species, even when virus is recoverable from clinical samples. Automated sequencing of PCR amplicons, or next generation sequencing, offer potential alternatives to speciation, but these can be costly, and demand high levels of expertise and considerable capital investment. Oligonucleotide microarrays offer an alternative that demands relatively lower expertise and is affordable by smaller laboratories. This is particularly important where more than one lyssavirus is known to be present and a patient has been diagnosed with rabies. In recent years, human deaths due to infection with European bat lyssavirus type 2 (EBLV-2), Duvenhage virus (DUVV), and Irkut virus have all required intensive investigation to confirm the causative agent to provide reassurance to public health authorities that RABV was not responsible.
The main disadvantage of oligonucleotide microarray, particularly in its dissemination to parts of the world where other lyssaviruses and RABV cause disease, is clearly cost. Initial investment is required to purchase specific equipment for nucleic acid amplification, hybridization, and scanners to measure signal intensity. Consumable and reagents costs, including the cost of microarray slides, are also a barrier to introduction, in addition to the cost of training and maintenance of skills for staff to perform the assay. Microarray detection of viruses is considerably slower when compared to other technologies, such as real-time RT-PCR, and it is this technology that is coming to dominate innovations in rabies diagnosis with gradual spread into the developing world.
In conclusion, oligonucleotide microarrays have been applied to the detection of RABV and other members of the genus lyssavirus, 6,23 and have the ability to identify the lyssavirus species in a particular sample. However, the barriers to its widespread application, other than within a pan-viral or pan-pathogen platform, are considerable, and it is unlikely that oligonucleotide microarray will find widespread application in the detection of RABV as a diagnostic test.
Drosophila Models of Aging
Satomi Miwa , Alan Cohen , in Handbook of Models for Human Aging , 2006
Microarray analysis is a method that makes use of gene chips to which thousands of different mRNAs can bind and be quantified. By using such chips to quantify mRNA levels in different tissues or in individuals under different treatments, tens or hundreds of specific genes which vary in relation to the tissue or treatment can be identified, aiding in a mechanistic understanding of the differences. Further work can then be conducted using candidate locus approaches (see below). In contrast to QTL, which looks at allelic variation, microarray analysis looks at gene regulation: potentially, but not necessarily, a result of allelic variation.
As with QTLs, care must be exercised not only to control for genetic background and environment, but also to limit interpretation of results to the genetic background and environment studied. Also, it must be remembered that microarray analysis is by its nature correlative, and that further study of patterns is generally necessary for clear interpretation. That said, microarray analysis remains one of the most powerful techniques for examining genetic processes underlying physiological variation.
Drosophila in particular are well suited to microarray analysis (a) because the genome is relatively small, meaning that most of it can be analyzed with a single microarray and that potentially important patterns are less likely to be missed, and (b) because the functions of many genes have already been studied, facilitating interpretation of results. Microarray analysis has been used in Drosophila to characterize gene expression changes during dietary restriction and with aging ( Pletcher et al., 2002 ).
Microarray experiment validation meaning - Biology
8 hours due to maintenance in our data center. This interval could potentially be shorter depending on the progress of the work. We apologize for any inconvenience. *** --> *** DAVID will be down from 5pm EST Friday 6/24/2011 to 3pm EST Sunday 6/26/2011 due to maintenance in our data center. This interval could potentially be shorter depending on the progress of the work. We apologize for any inconvenience. *** --> *** We are currently accepting Beta users for our new DAVID Web Service which allows access to DAVID from various programming languages. Please contact us for access. *** --> *** The Gene Symbol mapping for list upload and conversion has changed. Please see the DAVID forum announcement for details. --> *** Announcing the new DAVID Web Service which allows access to DAVID from various programming languages. More info. *** --> *** DAVID 6.8 will be down for maintenance on Thursday, 2/23/2016, from 9AM-1PM EST *** -->
*** Welcome to DAVID 6.8 ***
*** If you are looking for DAVID 6.7, please visit our development site. *** -->
*** Welcome to DAVID 6.8 with updated Knowledgebase ( more info). ***
*** If you are looking for DAVID 6.7, please visit our development site. *** -->
*** Welcome to DAVID 6.8 with updated Knowledgebase ( more info). ***
*** The DAVID 6.7 server is currently down for maintenance. *** --> *** Please read: Due to data center maintenance, DAVID will be offline from Friday, June 17th @ 4pm EST through Sunday, June 19th with the possibility of being back online sooner. *** -->
Here, we evaluated the potential of RNA-seq to predict clinical endpoints in comparison to microarrays. We generated gene expression profiles from 498 primary neuroblastoma samples using RNA-seq and microarrays, which represents, to the best of our knowledge, the most comprehensive description of a single cancer entity’s transcriptome. We demonstrate that gene expression profiles of neuroblastoma are tremendously complex, corresponding to findings on the transcriptomic landscape of other human cells published recently [9, 12, 30]. In the entire neuroblastoma cohort, we found 48,415 genes and 204,352 transcripts to be expressed, comprising 86.7 % and 77.3 % of all features annotated in the AceView database, respectively. We also identified >39,000 novel exons to be expressed in neuroblastoma, providing further evidence that the human transcriptome still exceeds the complexity reflected by current reference databases such as RefSeq, Gencode, and AceView. The comparison of gene expression profiles of four major clinico-genetic subgroups revealed that RNA-seq identified almost twice as many DEGs as microarrays. Of note, DEGs determined by RNA-seq comprised 80.1 % of the DEGs detected by microarrays, pointing towards the reliability of identifying DEGs by either method. One reason for the discrepant numbers received by RNA-seq and microarrays derives from the fact that 6,939 DEGs identified by RNA-seq were not represented by a probe on the microarray. In addition, 4,776 DEGs were not detected by microarrays although the genes were represented by a probe, which may be at least partly attributed to our analytical approach which was taking expression profiles at the transcript level into account. Taken together, our study substantiates that RNA-seq is capable of providing much more detailed insights into the transcriptomic characteristics of neuroblastoma than microarrays.
To systematically compare the potential of RNA-seq- and microarray-based models for clinical endpoint prediction, we utilized various data annotation pipelines and considered different feature levels to establish nine expression profiles per sample derived from RNA-seq data, complemented by one expression profile derived from microarray analyses. We generated 360 predictive models for six endpoints covering a broad range of prediction difficulties. Evaluation of the prediction performances in the validation set revealed that the endpoint represents the most relevant factor affecting model performances, which is well in line with the findings of the MAQC-II study . By contrast, neither the technical platform (that is, RNA-seq vs. microarrays) nor the RNA-seq data annotation pipeline significantly affected the variability of prediction performances. Collectively, our data demonstrate that RNA-seq and microarray-based models perform similarly in clinical endpoint prediction.
We also noticed that models based on different feature levels predicted clinical endpoints with comparable accuracies. In turn, this result implies that models based on exon-junction levels perform equally well as models based on gene levels. These findings may impact the development of expression-based classifiers to be used in clinical settings, which are frequently transferred from high-throughput analyses to RT-qPCR-based assays [6, 20]: While assays based on gene expression levels may lack specificity due to uncertainties on the underlying relevant transcript variants, exon-junctions identified by RNA-seq provide an unambiguous source of expression information for developing specific diagnostic tests.
Our results do not support the hypothesis that the more extensive transcriptomic information provided by RNA-seq in comparison to microarrays may improve gene expression-based prediction performances in general. A possible explanation for this finding might be that the inherent complexity of RNA-seq data may promote over-fitting effects in the model development process, leading to over-optimistic internal prediction performances that cannot be reproduced in external validation cohorts . We noted, however, that the correlation of internal and external validation performances was almost identical for RNA-seq and microarray-based models, indicating that over-fitting effects are independent of the technological platform. An alternative explanation for our results may be inferred from the observation that the proportion of RefSeq-annotated features in the prediction models was in the range of, or even above their proportion in the AceView database for most endpoints. This finding may suggest that the predictive information of RefSeq-annotated genes represented by standard microarrays is saturated, and that predictive information of more complex transcriptomic data provided by RNA-seq is largely redundant. It has to be noted, though, that models for endpoints that were difficult to predict (that is, EFS HR, OS HR) tended to disproportionately recruit features that are not annotated in RefSeq, suggesting that these features may considerably contribute to the prediction accuracy in these endpoints.
Both gene expression-based models derived from RNA-seq and microarray analyses were capable of predicting patient outcome in the entire neuroblastoma cohort accurately, thereby validating results from previous studies and underscoring their potential clinical utility for risk estimation in neuroblastoma [16–18, 20]. Notably, we observed that models containing 100 to 1,000 features on average performed better than models containing fewer features. This finding may argue against ambitious efforts to minimize feature numbers in predictive models, as has been done in the past [20, 32]. In addition, we found that the best performing models were able to predict outcome of high-risk patients with a similar precision as previously published multigene signatures [18, 20, 33], and independently from current prognostic markers. While the prognostic value of such multigene signatures needs to be validated in independent high-risk neuroblastoma cohorts, these findings may represent a starting point to establish biomarker-based risk assessment in this challenging patient subgroup.
For the empirical evaluation using the E-MEXP-1091 and <"type":"entrez-geo","attrs":<"text":"GSE12930","term_id":"12930">> GSE12930 datasets, the lowess approach  was used to normalize the data. Per-gene normalization was then performed centering the expression data by the median. Analysis was carried out on all genes regardless of flags.
GSEA and Global
All analysis was conducted in R . The Bioconductor  library and the GSEA 1.0 R package  were used. For the Global methodology, the Global test function in the Global test library was used to identify significant pathways.
Assume that there are M genes belonging to a pathway. Subtract from each gene expression value, the median expression value obtained from the combined treatment and control groups. This process aligns the data thereby inducing subsequent analyses to be sensitive to changes in the mean. Next, for the j th subject in group i, let ωij, represent the vector of ranks of the aligned intensity values of the M genes in the pathway. Set
The use of ranks serves two purposes. First, it captures for each subject, the correlation pattern of the aligned expression values. Second, it allows for a subsequent nonparametric analysis.
Motivated by the methods of Feigin and Alvo , we propose the test statistic
where prime indicates the transpose of the vector. Under the hypothesis that there is no change between the two groups, the statistic S should be small in magnitude. Let Sobsbe the value of the observed statistic.
Next, we propose a permutation test based on S. Under the null hypothesis that no change has occurred, the subjects in the two groups are interchangeable. Hence, we compute for each selection of n1subjects from n a value of the statistic S. The nominal p-value is then given as
When the total number of possible permutations is large, we randomly choose 1000 permutations among them.
Modified Rank Test
The Rank test is defined independently of the other genes contained in the microarray. Efron and Tibshirani  considered two different hypotheses in connection with the problem of assessing the statistical significance of a pathway. The random null hypothesis states that the M genes in the pathway of interest have been chosen at random from the array. Hence, the null distribution of the test statistic is obtained by considering its value over all the possible sets of M genes in the array. On the other hand, to each subject corresponds an M-vector of expression values. The permutation hypothesis in that case states that the vectors are independent and identically distributed and hence, the distribution of the test statistic is obtained by permuting the vectors. As Efron and Tibshirani  point out, both hypotheses have shortcomings. The first tends to ignore correlations among the genes whereas the second does not take into account the array from which the genes are drawn. Instead, they proposed an adjusted statistic which re-standardizes the observed statistic Sobswith mean m* and standard deviation σ* as follows:
where m*, σ* are the mean and standard deviation obtained by randomly selecting gene sets from the entire microarray and msand σs are the mean and standard deviation obtained by permutation of the labels for the specific pathway.
Selection of Differentially Regulated Genes & Data Analysis
A method of objective gene selection was sought to avoid the reliance simply on a single arbitrary fold-change cut-off, which is known to be overly influenced by both small and large absolute expression levels. The chosen method includes (A) the determination of the upper X% of highest fold changes within narrow bins of absolute expression levels, (B) the rejection of very small absolute values, and (C) the subsequent ranking of genes by a combined fold change/absolute difference calculation.
(A) Selection of the upper X% of highest fold changes within binned absolute expression levels
The data from a typical Affymetrix experiment contains an average difference (Avg.Diff) value, which can be described as the difference in intensity between a perfect match oligonucleotide and a mismatch oligonucleotide. In order to clarify this parameter in terms of the present model, the term "absolute expression" will be used in place of "average difference". As usually indicated in literature, both minimal and negative absolute expression values are set to a common number in order to eliminate genes with negative expression levels and to reject essentially uninterpretable information. Therefore, as a first-pass filter, genes with absolute expression values of less than 20 were set to 20 and all genes which had a value of 20 across all four diets were immediately rejected. This process left 9391 genes in the liver out of the original 13179 genes represented on the Mu11K GeneChip. An additional parameter, highest fold change, was then applied to these remaining genes. HFC can be defined as:
where A,B,C,D, etc. represent the individual microarray results for each gene
The proposed determination of HFC is highly influenced by absolute expression, and trends can readily be observed in our data set where HFC is negatively correlated with absolute expression. For example, it can be seen that with absolute expression values higher than 5000, it is unlikely to have HFC greater than 1.5, but with absolute expression values near 50, it is very easy to observe an HFC of ≥ 2. It should be noted that the present experiment is comprised of four diets or treatments however, the HFC can be easily calculated for any number of experimental conditions. Furthermore, similar trends can be observed in numerous Affymetrix datasets we have examined (data not shown).
An ultimate goal was to develop a model that would account for absolute values when filtering genes on fold change. The selection of differentially expressed genes is essentially a search for outliers, i.e. gene data lying outside the normal distribution of differences relative to a control state, and which can not be ascribed to chance or natural variabilty. In order to determine those genes which are outliers, it is necessary to either measure the variability of the system or to make valid assumptions regarding the normal distribution of variability. In the present model we assume that: (1) variability in gene expression measurements are related to the absolute expression level and (2) that if a broad sampling of the transcriptome is measured then only a small number of genes will actually be outliers even in the harshest of experimental treatments. Assumption (1) is a fairly general analytical concept, i.e. that the closer data is to the measurement threshold the higher the variability is in that measurement. Assumption (2) appears to be empiricaly valid when surveying the literature for high-density microarray experiments which evaluate severe biological events, from caloric restriction [10,11] to apoptosis [12,13]. In these experiments, through various selection techniques, it was found that less than 5% of the total number of genes probed were differentially regulated. Therefore, in order to develop the present model of gene selection, the validity of selecting outliers was evaluated for a range of highly variable genes, from 5% of the population on up.
The present model was developed by binning gene expression data into tight classes across the range of absolute expression values, i.e. 20-50, 50-100, 100-150, etc. and then selecting the upper 5% of HFC values for further consideration. Binning was carried out in such a manner as to ensure that there was never a bin containing zero genes or fewer genes than the proceeding bin, therefore bin sizes were not always equal. It is possible to search separately for the 5% of genes with the greatest HFCs in each class however, in order to simplify the overall selection, we modeled the relationship between absolute expression, defined as MIN(diets A,B,C,D) value and HFC (eqn 1) in order to set a limit fold change (LFC). The relationship can be modeled using a simple equation of the form LFC = a+b/x (with a and b depending on the number of genes to be selected). Figure 1a demonstrates that as the selection criteria becomes more strict (top 5% → 3% → 1% of genes), the LFC curves change, yet converge at expression levels above 1000. The simple equation contains two parameters that have various repercussions on gene selection. Firstly, a sets the asymptote, which corresponds to the minimum highest fold change value that can be observed at any given absolute value. Secondly, b affects the LFC at a given absolute value, and is therefore highly influenced by this latter value. For example, the lower the absolute values the greater the LFC, and vice versa.
The relationship between absolute value, limit fold change (LFC), and variance across the absolute expression range. A) The various curves indicate the LFC required at different absolute values in order to be considered a significantly changed gene. As the selection criteria increases, the LFC increases, indicating that the 5% fold change model (green line) is more permissive than the 1% fold change model (red line). The various fold change models produced the curves with the following equations: A) in the liver: 5% LFC model = 1.52 + (100/absolute value) 3% LFC model = 1.55 + (140/absolute value) 1% LFC model = 1.70 + (185/absolute value). B) Examining the variance of each gene across the four dietary treatments enables the identification of those genes determined significantly changed. (•) represents genes below the 99.9% confidence level, () represents those genes selected by the 5% fold change model, and (+) represents those genes above the 99.9% confidence level. The various lines represent different confidence levels (i. 99.9%, ii. 99.999%, and iii. 99.99997%). As the fold change model increased (5% → 1%), concordance between the fold change model and the variance data (at a confidence level of 99.9%) increased (embedded table: x(y%), where x represents the number of genes with concordance (and y the percentage of genes with concordance)).
Using the equations in Figure 1a, the selection of genes for further consideration is then objective, simple, and global. A gene is selected with the HFC approach if MAX(A,B,C,D)/Min(A,B,C,D) > a+b/Min(A,B,C,D). After applying the 5% LFC gene filter, 489 genes remained in the list out of the 9391 genes potentially differentially expressed, selected from the original 13179 genes represented on the GeneChip. When interested in only the top 3% or 1% of significant genes, the total number of genes that meet the LFC requirements, and correspondingly the number of genes per bin, drops off rapidly (245 and 102 genes, respectively).
(B) The rejection of very small absolute values
Lastly, in an effort to objectively determine a minimum expression level cut-off we examined the final distribution of absent & present calls (Absence Call) across gene bins in the remaining set of genes. It was determined that Affymetrix absence/presence calls would not be used a priori as criteria critical to the selection of significantly regulated genes, but that it would rather be used as a post-selection criteria. The absence call has been previously noted to be problematic, and has two potential drawbacks: 1) the assignment of an absence call is based on the ad hoc characterization of oligonucleotide matches & mismatches for which the validity has been previously challenged, and 2) is not empirically reliable for individual genes, i.e. the confidence in the call is not high . However, it was expected that the distribution of absent calls across many genes at a range of absolute expression levels would not be random, and that the trend would be an important crosscheck for the confidence placed in changed genes at low expression levels.
As expected the distribution of absent calls demonstrated that it was predominantly the very lowly expressed genes (95% of genes called absent, absolute expression ≤ 207), which were called absent across all four diets by the Affymetrix analysis software. This analysis also supports the idea that a threshold for an absolute minimum expression level could be developed empirically for each data set examined. In the present case, this would imply that any gene, which didn't have at least a value of 207 in one experimental condition needs to be rejected independent of the fold change measured. In practice, more than 95% of genes meeting these criteria would also be rejected on the basis that they were consistently marked absent across all experimental conditions. Therefore, such genes were eliminated in the last method of gene filtration. After removing these lowly expressed genes, based on these objective criteria, 329 genes remained in the list out of the original 13179 gene probe sets. The selected genes were considered to be potentially differentially regulated by our dietary treatments in the sense that these are the most highly differentially regulated genes within the context of the present experiment.
(C) Assignment of Gene Rank
Following overall gene selection, a rank of "importance" or "interest level", defined as Rank Number (RN), based on both the magnitude of fold change and absolute expression values was assigned to each selected gene. The RN for each gene was determined by calculating a Rank Value (RV), which can be defined as: RV = HFC * (Max - Min). The RV is an abstract value that simply gives great importance to those genes that have a high fold change and simultaneously high differences in absolute expression values. After calculation of RV, gene lists were sorted and then assigned a simple rank of 1,2,3,4. 329 in order of RV importance, where a gene with a RN of 1 corresponds to the gene with the highest RV. Both RV and RN are simply aids for the discussion of differential gene effects, which add the concept of relative weight or "importance" amongst selected genes. This concept then provides a further basis for the selection of genes for validation studies as is detailed below.
(D) Model validation
Real-time polymerase chain reaction
The results obtained from a microarray experiment are influenced by each step in the experimental procedure, from array manufacturing to sample preparation and application to image analysis . The preparation of the cDNA sample is highly correlated to the efficiency of the reverse transcription step, where reagents and enzymes alike can influence the reaction outcome. All of these factors correspondingly affect the representation of transcripts in the final cDNA probe, which necessitate the need for validations by complementary techniques. Analysis by northern blot and RNAse protection assays are commonly reported in the literature however, the emerging "gold-standard" validation technique is RT-PCR . As microarrays tend to have low dynamic range, which leads to small but significant under-representations of fold changes in gene expression, RT-PCR with a higher dynamic range is used more to validate the observed trends rather than duplicate the absolute values obtained by chip experiments [17,16,18].
Having chosen genes that lie across the ranking system, RT-PCR was performed in triplicate for each experimental condition (Diet A, B, C, D) using the same pooled stocks of liver RNA (5 mice/experiment). Genes were compared to the endogenous controls β-actin and GAPDH, which were determined not to have significantly changed across the dietary treatments by both the LFC (microarray data) and a student's t-test (RT-PCR). Subsequently, significant changes by RT-PCR were calculated by the student's t-test with a predefined nominal α level of 0.05 where Diet B, C, and D were independently compared to the control diet A. The overall concordance of trends between the two techniques was 73% (e.g. an increase/decrease in gene expression seen by microarray was also seen by RT-PCR). For those genes whose results agreed between the two experiments, 68% of these results indicated larger fold changes by RT-PCR than those identified by array analysis. This concordance includes both genes determined as significantly changed as well as those genes determined not to have been significantly changed. When only those genes that were considered to be significantly changed by RT-PCR were examined, the concordance increased slightly to 80%.
What is immediately noticeable through the color scheme (Table 1) is that genes with high RN (low RV) have little to no concordance between the two techniques where red indicates no concordance and blue indicates either one or two (out of three) of the results did not agree. When specifically examining fatty acid synthase (FAS), a highly expressed gene, one can quickly see that microarray fold changes of less than 2 can be corroborated between the two experimental techniques, reinforcing the strength of this fold change model.
As the selection criteria with the microarray data was that the HFC must be greater than the LFC model, the expectation is that the LFC trend line can be validated by RT-PCR. This is predominantly the case across the full dynamic range of data selected by the model except for very lowly expressed genes such as the RAS oncogene. For genes with slightly lower RN (higher RV), such as ABCA1, and HSP5 some concordance is seen, indicating that confidence in gaining with these genes, and that as a group they can still be taken into account when looking for trends in gene expression. For genes with a RN lower than 176 (RV > 1156 e.g. USF-2) concordance quickly approaches 100%, indicating high confidence when discussing gene trends or individual gene results. These results in total reinforce the concept that RN is correlated with confidence / validity within the selected gene set resulting from the LFC model.
The genes discussed and validated in this report were identified using the 5% fold change model however the fold change percentage can be varied to meet both the researcher's and experiment's needs. It must be stressed that the 5% fold change model was chosen under the assumption that a relatively small percentage of genes will have their expression altered under any given condition. Therefore, selecting a fold change model of 5% may be either too permissive, where false positives are selected as differentially changed, or too restrictive, where true positives are not selected. Within the context of the present study, validation of the microarray results indicates that genes with low rank values are often more difficult to confirm by complementary techniques. Using the data obtained from RT-PCR, if one assumes that all genes with a RN below 176 (corresponding to RV > 1156) can be validated, then one would expect that these genes would be concentrated at higher expression levels. However, when the spread of those genes with a rank of 1 to 176 is examined, it was observed that these genes comprise a wide range of expression levels, indicating that the fold change model is objectively selecting differentially regulated genes across a wide range of absolute expression levels (data not shown), and that confidence in that selection increases with RV.
Variance Analysis with Real-time PCR
Variability is introduced into microarray data from two sources: biological variation (whether in vitro or in vivo) and measurement variation (hybridization, processing, scanning, etc.). In a brief effort to examine variability between individual mice, i.e. biological variability, RT-PCR measurements across control mice were examined using a subset of the genes examined by RT-PCR. Each gene was examined in triplicate in each of the five mice, and the variation in ΔCt (detection threshold) was determined. The Ct indicates the relative abundance of any particular gene, and when normalized to an endogenous control (β-actin and GAPDH) allows the relative amounts of a gene to be calculated. RT-PCR indicated as did the microarray variance data, that lowly expressed genes have a higher variation thereby hinting that biological and measurement variance are both influenced by absolute expression levels. The equation of the line was deemed significant (with a p-value of 0.014 and 0.013 when normalized against β-actin and GAPDH, respectively). This again confirms the concept that highly expressed genes have little variance, and that small fold changes do represent a meaningful biological event.
Validation of the LFC model via characterization of measurement variability
The concept that variability and absolute expression are related has recently been examined by Coombes and colleagues however, they examined only the variability of replicate spots on a single slide . This concept has now been further extended here to the examination of variability between genes on different microarrays. Measurement variance was examined following the development of the LFC model, and was therefore treated as a separate method for the confirmation of this model. To further understand the nature of measurement variability within the current study, duplicate Mu11K Affymetrix microarrays for the controls were examined. A pooled RNA sample from mice (n = 5) fed the control diet was hybridized to two different chips, and the data was analyzed in order to characterize measurement variability (data not shown). It was apparent from the trend that as absolute expression levels increase, the coefficient of variation (CV= SD/MAE) decreases. By overlaying the trendline of the variability data on those genes determined to be significantly regulated by the LFC model, the CV upper confidence level for these selected genes could be elucidated.
In order to estimate the CV without taking into account extreme values of the duplicate we used a robust estimator, represented by the following equation:
Where n = 2 and p = 0.5 (as the median CV of duplicate gene sample was used), the above equation enabled the CV to be determined by narrow bins of mean expression level, where extreme values are not accounted for.
The mean absolute expression of 13057 data points (genes) across the four diets were plotted against CV, and indicated a similar trend for the variability data where a high mean absolute expression results in a low CV (Figure 1b). Applying the CV derived from the duplicate sample data (eqn. 2) to the quadruplicate diet data enables the calculation of the CV upper confidence level (by bins of absolute expression level) using the following equation:
Where n= 4 and p= 0.001, 0.00001, 0.0000003, depending on the level of confidence desired (1-p).
Equation 3 allows us to identify those genes with a variance above the measurement variability . This greater variability arose due to combined pool (biological) and treatment variabilities.
This confidence level, by altering p, could then be raised or lowered according to the level of confidence desired therefore, modeling the variance data provides an objective method for examining the variation of genes across the complete range of absolute expression values. The spread of the data indicates that most of the 13000 genes are both lowly expressed and highly variable across the four chips. A further examination of the data indicated that 95% of the genes determined to be 'absent' across all four diets by Affymetrix software had a mean absolute expression less than 207.
With the LFC model, genes were initially selected if they were in the top X% of the bin highest fold changes however the starting point (X%) was solely chosen based on the percentage of genes shown to be differentially regulated across a wide-range of published biological studies. However, the genes selected by the X% fold change model were then verified, with concordance results, by both RT-PCR and the variance data. Genes identified by the 5% fold change model were overlayed on the variance data corresponding to the four diets, and the confidence level for the X%-data selection was determined (Figure 1b). Concordance of 94.1%, 96.6% and 98.4% for the 5%, 3% and 1% fold change models, respectively, was observed with an upper confidence level selection of 99.9% (Figure 1b, inset table). In addition, overall concordance between microarray data and RT-PCR was examined in the different fold change models and indicated 73.3%, 81.5%, and 94.4% concordance for the 5%, 3%, and 1% fold change models, respectively (Figure 1a). The degree of concordance with RT-PCR results and the high confidence level (99.9%) obtained with the variance data reinforces that the X% fold change model is a simple, efficient, objective and statistically valid method for the identification of significantly differentiated genes.
FGT Part 5 - Design of Microarray Experiments
averaging replicates will give better estimates of the mean. replicates allow statistical inferences to be made.
Biological vs Technical Replication. Techincal ccome from the same sample i ndifferent chips. biological came from different samples. replicates is a scale between biological and technical
3. Level of Inference
Always compromise between precision and generality
what level do conclusion need to be made --> to just the technical sample, to all experiment in cell lines, to all mices?
More general solution inferences capture more variance
more variablity mena more replicates
4. Stastitical issues
a. Level of variability
statistically significant does not always mean biologically significant
b. Multiple testing and False Discovery Rate (FDR)
Usually applies T-Test for each probesets. For each test, P-Values are the probabilities that the test would produce a result as least as extreme assuming the null hypothesis are true. We expect 5% chance that the test result in false positives for multiple test. The FDR was applied to avoid high false positives. Which accounts for the number of test applied.
c. Effect size
How large of a change we want to detect
Our ability to discover truth. More replication more power
Common Design Principles
1. Single Factor
varying single factor at once. example with ot wothout drug. for dual channel place comparison of interest near each other. short time can be treatesd on a single factor experiment
Microarray experiments with paired designs are often encountered in a clinical setting where for example, samples are isolated from the same patients before and after treatment. Describe the reasons that it might be attractive to employ paired design in microarray experiment!
reduces variability in biological replicates
still captures variability with respect to response between patients
-Pooling vs Amplification
Mutiple isolation are pooled to give enough biological material of the expression level
gives more robust estimation of the expression level
but it can be dominated by one unusual samples
pool only when necessary and consider amplification as alternative
making sub pools is a compromise, ex: pool 15 into 3 x 5
amplificaiton is alternative to overcame limitation due to sample availability
but its not possible to introduce amplification without bias
-Usually limited by cost and sample availability
-consider other experiment for informal estimation parameters
-usually 3-5 replicate for well known strain
or 30-200 for human population inference
consider extendable desing or pilot experiment
Comparing two conditions
A simple microarray experiment may be carried out to detect the differences in expression between two conditions. Each condition may be represented by one or more RNA samples. Using two-color cDNA microarrays, samples can be compared directly on the same microarray or indirectly by hybridizing each sample with a common reference sample [4, 6]. The null hypothesis being tested is that there is no difference in expression between the conditions when conditions are compared directly, this implies that the true ratio between the expression of each gene in the two samples should be one. When samples are compared indirectly, the ratios between the test sample and the reference sample should not differ between the two conditions. It is often more convenient to use logarithms of the expression ratios than the ratios themselves because effects on intensity of microarray signals tend be multiplicative for example, doubling the amount of RNA should double the signal over a wide range of absolute intensities. The logarithm transformation converts these multiplicative effects (ratios) into additive effects (differences), which are easier to model the log ratio when there is no difference between conditions should thus be zero. If a single-color expression assay is used - such as the Affymetrix system  - we are again considering a null hypothesis of no expression-level difference between the two conditions, and the methods described in this article can also be applied directly to this type of experiment.
A distinction should be made between RNA samples obtained from independent biological sources - biological replicates - and those that represent repeated sampling of the same biological material - technical replicates. Ideally, each condition should be represented by multiple independent biological samples in order to conduct statistical tests. If only technical replicates are available, statistical testing is still possible but the scope of any conclusions drawn may be limited . If both technical and biological replicates are available, for example if the same biological samples are measured twice each using a dye-swap assay, the individual log ratios of the technical replicates can be averaged to yield a single measurement for each biological unit in the experiment. Callow et al.  describe an example of a biologically replicated two-sample comparison, and our group  provide an example with technical replication. More complicated settings that involve multiple layers of replication can be handled using the mixed-model analysis of variance techniques described below.
The simplest method for identifying differentially expressed genes is to evaluate the log ratio between two conditions (or the average of ratios when there are replicates) and consider all genes that differ by more than an arbitrary cut-off value to be differentially expressed [10–12]. For example, if the cut-off value chosen is a two-fold difference, genes are taken to be differentially expressed if the expression under one condition is over two-fold greater or less than that under the other condition. This test, sometimes called 'fold' change, is not a statistical test, and there is no associated value that can indicate the level of confidence in the designation of genes as differentially expressed or not differentially expressed. The fold-change method is subject to bias if the data have not been properly normalized. For example, an excess of low-intensity genes may be identified as being differentially expressed because their fold-change values have a larger variance than the fold-change values of high-intensity genes [13, 14]. Intensity-specific thresholds have been proposed as a remedy for this problem .
The t test is a simple, statistically based method for detecting differentially expressed genes (see Box 2 for details of how it is calculated). In replicated experiments, the error variance (see Box 1) can be estimated for each gene from the log ratios, and a standard t test can be conducted for each gene  the resulting t statistic can be used to determine which genes are significantly differentially expressed (see below). This gene-specific t test is not affected by heterogeneity in variance across genes because it only uses information from one gene at a time. It may, however, have low power because the sample size - the number of RNA samples measured for each condition - is small. In addition, the variances estimated from each gene are not stable: for example, if the estimated variance for one gene is small, by chance, the t value can be large even when the corresponding fold change is small. It is possible to compute a global t test, using an estimate of error variance that is pooled across all genes, if it is assumed that the variance is homogeneous between different genes [16, 17]. This is effectively a fold-change test because the global t test ranks genes in an order that is the same as fold change that is, it does not adjust for individual gene variability. It may therefore suffer from the same biases as a fold-change test if the error variance is not truly constant for all genes.
Modifications of the ttest
As noted above, the error variance (the square root of which gives the denominator of the t tests) is hard to estimate and subject to erratic fluctuations when sample sizes are small. More stable estimates can be obtained by combining data across all genes, but these are subject to bias when the assumption of homogeneous variance is violated. Modified versions of the t test (Box 2) find a middle ground that is both powerful and less subject to bias.
In the 'significance analysis of microarrays' (SAM) version of the t test (known as the S test) , a small positive constant is added to the denominator of the gene-specific t test. With this modification, genes with small fold changes will not be selected as significant this removes the problem of stability mentioned above. The regularized t test  combines information from gene-specific and global average variance estimates by using a weighted average of the two as the denominator for a gene-specific t test. The B statistic proposed by Lonnstedt and Speed  is a log posterior odds ratio of differential expression versus non-differential expression it allows for gene-specific variances but it also combines information across many genes and thus should be more stable than the t statistic (see Box 2 for details).
The t and B tests based on log ratios can be found in the Statistics for Microarray Analysis (SMA) package  the S test is available in the SAM software package  and the regularized t test is in the Cyber T package . In addition, the Bioconductor  has a collection of various analysis tools for microarray experiments. Additional modifications of the t test are discussed by Pan .
Graphical summaries (the 'volcano plot')
The 'volcano plot' is an effective and easy-to-interpret graph that summarizes both fold-change and t-test criteria (see Figure 1). It is a scatter-plot of the negative log10-transformed p-values from the gene-specific t test (calculated as described in the next section) against the log2 fold change (Figure 1a). Genes with statistically significant differential expression according to the gene-specific t test will lie above a horizontal threshold line. Genes with large fold-change values will lie outside a pair of vertical threshold lines. The significant genes identified by the S, B, and regularized t tests will tend to be located in the upper left or upper right parts of the plot.
Volcano plots. The negative log10-transformed p-values of the F1 test (see Box 3b) are plotted against (a) the log ratios (log2 fold change) in a two-sample experiment or (b) the standard deviations of the variety-by-gene VG values (see Box 3a) in a four-sample experiment. The horizontal bars in each plot represent the nominal significant level 0.001 for the F1 test under the assumption that each gene has a unique variance. The vertical bars represent the one-step family-wise corrected significance level 0.01 for the F3 test (see Box 3b) under the assumption of constant variance across all genes. Black points represent the significant genes selected by the F2 test with a compromise of these two variance assumptions.
Target Deconvolution vs Target Discovery
The phenotypic approach to drug discovery falls within the realm of target deconvolution, and involves exposing cells, isolated tissues, or animal models, to small molecules to determine whether a specific candidate molecule exerts the desired effect – which is observed by a change in phenotype. 3 Whilst numerous animal models can be used for the characterization of small molecules and small-scale drug screening approaches, use of mammalian cells is often favored due to their compatibility with high-throughput screening (HTS) and greater physiological relevance.
The phenotypic approach goes beyond individual proteins or nucleic acids and involves the study of entire signaling pathways. The drug’s effect is determined before the specific biological (drug) target that underlies the observed phenotypic response is identified.
Advantages and challenges of phenotypic drug discovery
The greatest advantage phenotypic approaches have over target-based is their ability to demonstrate the efficacy of a drug in the context of a cellular environment. The drug is acting on the target in its ‘normal’ biological context, rather than on a purified target in a biochemical screen.
Cost, availability of cells, complex assay methodology, and throughput are all potential challenges associated with cell-based phenotypic screens. However, as assays become miniaturized and the use of three-dimensional cell models (organoids and spheroids) continue to gain momentum, both scalability and physiological relevance have been improved, leading to greater adoption of phenotypic approaches.
In addition, this resurgence in phenotypic screening has encouraged further major technological advances, including the development of iPS cell technologies, gene-editing tools, and detection and imaging assays, 5 which have again positively impacted this approach.
Advantages and challenges of target-based discovery
The fact that knowledge of a drug candidate’s molecular mechanism is understood from the offset presents as a key advantage over phenotypic approaches and target-based methods are typically easier to carry out, less-expensive to develop, and the process is generally faster. 6
Target-based drug discovery can exploit numerous approaches (including crystallography, computational modeling, genomics, biochemistry, and binding kinetics) to uncover exactly how a drug interacts with the target of interest, enabling: 6
- Development of the structure-activity relationship (SAR) (the relationship between the structure and biological activity of a molecule)
- Development of biomarkers
- Discovery of future therapeutics that act at the specific target of interest
|Technique||Drug discovery approach|
|Affinity chromatography||Target deconvolution|
|Protein microarray||Target deconvolution|
|Reverse transfected cell microarray||Target deconvolution|
|Biochemical suppression||Target deconvolution|
|siRNA||Target deconvolution/ discovery/ validation|
|DNA microarray||Target discovery|
|Systems biology||Target discovery|
|Study of existing drugs||Target discovery|
The analysis of microarray data poses considerable computational challenges. Academic and commercial software environment and applications have been and are being developed to meet these challenges. The commercial applications have primarily focused on user-friendliness, by providing fancy point and click graphical user interfaces. While this may be a desirable feature for some, it is unlikely to be a useful feature for research. What is important to research is for the software to be flexible and extensible so as to allow the user to determine the analysis method thought to be best suited to address the scientific questions at hand. To this end, we have found the R statistical environment 48 to be an ideal match. It should be emphasized that R is not a software application designed to facilitate a certain number of prespecified analyses thought to be useful or important by the software developers, but rather “an environment to conduct statistical analyses and computation.” By providing the requisite building blocks, including an object-oriented programming language and outstanding facilities to produce graphics, the user is put in charge. These capabilities are complemented by extension packages contributed by other R users. Of special note is the Bioconductor project, 49 which provides a comprehensive library of extension packages specifically developed for the preprocessing, analysis, visualization, and annotation of molecular data. In addition to technical documentation, most Bioconductor packages offer vignettes, which serve as tutorials.
As an interpreted language, R may not be as fast as some compiled languages. It is possible to include C/C++ and FORTRAN code in R. It is also possible to call R from these languages to build stand-alone packages. Another powerful programming language used by the bioinformatics community is Python. R can be interfaced from Python through rpy and rpy2 . R can be installed on laptops, desktops, and servers running a variety of operating systems including GNU/Linux, Windows, and MacOS. It is open-source and distributed under a public license.
Many statistical algorithms and procedures used to analyze microarray data are parallelizable. Packages that allow the user to parallelize code over clusters or multicore servers include snow, multicore , and Rmpi . Graphical Processing Units (GPUs) provide another hardware resource for conducting stream computing. Two extension packages that enable the use of GPUs within R are gputools 50 and permGPU . 51
An important principle in conducting genomic research is reproducibility. This does not only apply to the scientific experiment where the use of technical or biological replicates is used to ascertain the reproducibility of the assay, but also applies to the quantitative component of the research. It should be noted that reproducibility is a necessary but not sufficient component of good research as poor research can be conducted in a reproducible fashion. The R statistical environment greatly facilitates the conduct of reproducible research by providing a framework for literate programming 57 through Sweave 56 by combining L A TEX (http://www.latex-project.org) as the typesetting engine and R as the computational engine.
Venables and Ripley 54 and Dalgaard 55 provide extensive and accessible accounts on conducting programming and statistical analyses using R. Gentleman et al. 56 and Hahne et al. 57 provide accounts on conducting statistical analysis using Bioconductor extension packages. All statistical analyses presented in this paper were conducted using R .