We are searching data for your request:

**Forums and discussions:**

**Manuals and reference books:**

**Data from registers:**

**Wait the end of the search in all databases.**

Upon completion, a link will appear to access the found materials.

Upon completion, a link will appear to access the found materials.

Examine the graph below

a. As temperature increases, what happens to the solubility of oxygen in water?

b. Water in a particular stream varies from 8 - 13 mg/liter. What temperature range does this reflect?

Fish require oxygen to live. As water passes through gills, dissolved oxygen (DO) is transferred to blood. Dissolved oxygen in water is affected by:

Photosynthesis: During light hours, aquatic plants produce oxygen. Mixing: Waves and waterfalls aerate water and increase oxygen concentration.

Decomposition: As organic material decays, bacteria consume oxygen. Salinity: As water becomes more salty, its ability to hold oxygen decreases.

Examine the graph below:

a. At which temperature is dissolved oxygen highest?

b. A trout requires more DO when the water temperature is 24^{o}C (75^{o}F) as compared to when the water temperature is 4^{o}C. (to support an increase in metabolic activities). How do the DO levels compare at these two temperatures?

c. A power plant discharges warm water into a river. Why would this have a detrimental effect on fish that live in the river? Be specific in your answer and relate it to this lab.

High levels of CO2 have been implicated in global warming and as a causative factor in mass extinctions.

a. Construct a line graph of the 1970–2018 data **by decade** from 1970 and ending with to 2010.

b. State a conclusion as to the change in CO2 levels over time.

## Abstract

Single-cell RNA sequencing measures gene expression at an unprecedented resolution and scale and allows the analysis of cellular phenotypes which was not possible before. In this context, graphs occur as a natural representation of the system —both as gene-centric and cell-centric. However, many advances in machine learning on graphs are not yet harnessed in models on single-cell data. Taking the inference of cell types or gene interactions as examples, graph representation learning has a wide applicability to both cell and gene graphs. Recent advances in spatial molecular profiling additionally put graph learning in the focus of attention because of the innate resemblance of spatial information to spatial graphs. We argue that graph embedding techniques have great potential for various applications across single-cell biology. Here, we discuss how graph representation learning maps to current models and concepts used in single-cell biology and formalise overlaps to developments in graph-based deep learning.

## 3.2 Base R plotting

The most basic function is plot . In the code below, the output of which is shown in Figure 3.2, it is used to plot data from an enzyme-linked immunosorbent assay (ELISA) assay. The assay was used to quantify the activity of the enzyme deoxyribonuclease (DNase), which degrades DNA. The data are assembled in the R object DNase , which conveniently comes with base R. The object DNase is a dataframe whose columns are Run , the assay run conc , the protein concentration that was used and density , the measured optical density.

Figure 3.2: Plot of concentration vs. density for an ELISA assay of DNase.

This basic plot can be customized, for example by changing the plot symbol and axis labels using the parameters xlab , ylab and pch (plot character), as shown in Figure 3.3. Information about the variables is stored in the object DNase , and we can access it with the attr function.

Figure 3.3: Same data as in Figure 3.2 but with better axis labels and a different plot symbol.

Annotating dataframe columns with “metadata” such as longer descriptions, physical units, provenance information, etc., seems like a useful feature. Is this way of storing such information, as in the DNase object, standardized or common across the R ecosystem? Are there other standardized or common ways for doing this?

There is no good or widely used infrastructure in regular R *data.frame*s for this, nor in the tidyverse (*data_frame*, *tibble*). But have a look at the *DataFrame* class in the Bioconductor package **S4Vectors**. Among other things it is used to annotate the rows and columns of a *SummarizedExperiment*.

Besides scatterplots, we can also use built-in functions to create histograms and boxplots (Figure 3.4).

Figure 3.4: Histogram of the density from the ELISA assay, and boxplots of these values stratified by the assay run. The boxes are ordered along the axis in lexicographical order because the runs were stored as text strings. We could use R’s type conversion functions to achieve numerical ordering.

Boxplots are convenient for showing multiple distributions next to each other in a compact space. We will see more about plotting multiple univariate distributions in Section 3.6.

The base R plotting functions are great for quick interactive exploration of data but we run soon into their limitations if we want to create more sophisticated displays. We are going to use a visualization framework called the grammar of graphics, implemented in the package **ggplot2**, that enables step by step construction of high quality graphics in a logical and elegant manner. First let us introduce and load an example dataset.

## Data interpretation: Uncovering and explaining trends in the data

The analyzed data can then be interpreted and explained. In general, when scientists interpret data, they attempt to explain the patterns and trends uncovered through analysis, bringing all of their background knowledge, experience, and skills to bear on the question and relating their data to existing scientific ideas. Given the personal nature of the knowledge they draw upon, this step can be subjective, but that subjectivity is scrutinized through the peer review process (see our Peer Review in Science module). Based on the smoothed curves, Jones, Wigley, and Wright interpreted their data to show a long-term warming trend. They note that the three warmest years in the entire dataset are 1980, 1981, and 1983. They do not go further in their interpretation to suggest possible causes for the temperature increase, however, but merely state that the results are "extremely interesting when viewed in the light of recent ideas of the causes of climate change."

There is only one correct way to analyze and interpret scientific data.

### Different interpretations in the scientific community

The data presented in this study were widely accepted throughout the scientific community, in large part due to their careful description of the data and their process of analysis. Through the 1980s, however, a few scientists remained skeptical about their interpretation of a warming trend.

In 1990, Richard Lindzen, a meteorologist at the Massachusetts Institute of Technology, published a paper expressing his concerns with the warming interpretation (Lindzen, 1990). Lindzen highlighted several issues that he believed weakened the arguments for global temperature increases. First, he argued that the data collection was inadequate, suggesting that the current network of data collection stations was not sufficient to correct for the uncertainty inherent in data with so much natural variability (consider how different the weather is in Antarctica and the Sahara Desert on any given day). Second, he argued that the data analysis was faulty, and that the substantial gaps in coverage, particularly over the ocean, raised questions regarding the ability of such a dataset to adequately represent the global system. Finally, Lindzen suggested that the interpretation of the global mean temperature data is inappropriate, and that there is no trend in the data. He noted a decrease in the mean temperature from 1940 to 1970 at a time when atmospheric CO_{2} levels, a proposed cause for the temperature increases, were increasing rapidly. In other words, Lindzen brought a different background and set of experiences and ideas to bear on the same dataset, and came to very different conclusions.

This type of disagreement is common in science, and generally leads to more data collection and research. In fact, the differences in interpretation over the presence or absence of a trend motivated climate scientists to extend the temperature record in both directions – going back further into the past and continuing forward with the establishment of dedicated weather stations around the world. In 1998, Michael Mann, Raymond Bradley, and Malcolm Hughes published a paper that greatly expanded the record originally cited by Jones, Wigley, and Wright (Mann, Bradley, & Hughes, 1998). Of course, they were not able to use air temperature readings from thermometers to extend the record back to 1000 CE instead, the authors used data from other sources that could provide information about air temperature to reconstruct past climate, like tree ring width, ice core data, and coral growth records (Figure 4, blue line).

Figure 4: Differences between annual mean temperature and mean temperature during the reference period 1961-1990. Blue line represents data from tree ring, ice core, and coral growth records orange line represents data measured with modern instruments. Graph adapted from Mann et al. published in IPCC Third Assessment Report. image © IPCC

Mann, Bradley, and Hughes used many of the same analysis techniques as Jones and co-authors, such as applying a ten-year running average, and in addition, they included measurement uncertainty on their graph: the gray region shown on the graph in Figure 4. Reporting error and uncertainty for data does not imply that the measurements are wrong or faulty – in fact, just the opposite is true. The magnitude of the error describes how confident the scientists are in the accuracy of the data, so bigger reported errors indicate less confidence (see our Uncertainty, Error, and Confidence module). They note that the magnitude of the uncertainty increases going further back in time but becomes more tightly constrained around 1900.

In their interpretation, the authors describe several trends they see in the data: several warmer and colder periods throughout the record (for example, compare the data around year 1360 to 1460 in Figure 4), and a pronounced warming trend in the twentieth century. In fact, they note that "almost all years before the twentieth century [are] well below the twentieth-century. mean," and these show a linear trend of decreasing temperature (Figure 4, pink dashed line). Interestingly, where Jones et al. reported that the three warmest years were all within the last decade of their record, the same is true for the much more extensive dataset: Mann et al. report that the warmest years in their dataset, which runs through 1998, were 1990, 1995, and 1997.

### Debate over data interpretation spurs further research

The debate over the interpretation of data related to climate change as well as the interest in the consequences of these changes have led to an enormous increase in the number of scientific research studies addressing climate change, and multiple lines of scientific evidence now support the conclusions initially made by Jones, Wigley, and Wright in the mid-1980s. All of these results are summarized in the Fourth Assessment Report (AR4) of the Intergovernmental Panel on Climate Change (IPCC), released to the public in 2007 (IPCC, 2007). Based on the agreement between these multiple datasets, the team of contributing scientists wrote:

Warming of the climate system is unequivocal, as is now evident from observations of increases in global average air and ocean temperatures, widespread melting of snow and ice, and rising global average sea level.

The short phrase "now evident" reflects the accumulation of data over time, including the most recent data up to 2007.

A higher level of data interpretation involves determining the reason for the temperature increases. The AR4 goes on to say:

Most of the observed increase in global average temperatures since the mid-20th century is very likely due to the observed increase in anthropogenic greenhouse gas concentrations.

This statement relies on many data sources in addition to the temperature data, including data as diverse as the timing of the first appearance of tree buds in spring, greenhouse gas concentrations in the atmosphere, and measurements of isotopes of oxygen and hydrogen from ice cores. Analyzing and interpreting such a diverse array of datasets requires the combined expertise of the many scientists that contributed to the IPCC report. This type of broad synthesis of data and interpretation is critical to the process of science, highlighting how individual scientists build on the work of others and potentially inspiring collaboration for further research between scientists in different disciplines.

Data interpretation is not a free-for-all, nor are all interpretations equally valid. Interpretation involves constructing a logical scientific argument that explains the data. Scientific interpretations are neither absolute truth nor personal opinion: They are inferences, suggestions, or hypotheses about what the data mean, based on a foundation of scientific knowledge and individual expertise. When scientists begin to interpret their data, they draw on their personal and collective knowledge, often talking over results with a colleague across the hall or on another continent. They use experience, logic, and parsimony to construct one or more plausible explanations for the data. As within any human endeavor, scientists can make mistakes or even intentionally deceive their peers (see our Scientific Ethics module), but the vast majority of scientists present interpretations that they feel are most reasonable and supported by the data.

If scientists disagree on how a set of data is interpreted, this generally

## 3.4 Proteins

In this section, you will investigate the following questions:

- What are functions of proteins in cells and tissues?
- What is the relationship between amino acids and proteins?
- What are the four levels of protein organization?
- What is the relationship between protein shape and function?

### Connection for AP ® Courses

Proteins are long chains of different sequences of the 20 amino acids that each contain an amino group (-NH_{2}), a carboxyl group (-COOH), and a variable group. (Think of how many protein “words” can be made with 20 amino acid “letters”). Each amino acid is linked to its neighbor by a peptide bond formed by a dehydration reaction. A long chain of amino acids is known as a polypeptide. Proteins serve many functions in cells. They act as enzymes that catalyze chemical reactions, provide structural support, regulate the passage of substances across the cell membrane, protect against disease, and coordinate cell signaling pathways. Protein structure is organized at four levels: primary, secondary, tertiary, and quaternary. The primary structure is the unique sequence of amino acids. A change in just one amino acid can change protein structure and function. For example, sickle cell anemia results from just one amino acid substitution in a hemoglobin molecule consisting of 574 amino acids. The secondary structure consists of the local folding of the polypeptide by hydrogen bond formation leading to the α helix and β pleated sheet conformations. In the tertiary structure, various interactions, e.g., hydrogen bonds, ionic bonds, disulfide linkages, and hydrophobic interactions between R groups, contribute to the folding of the polypeptide into different three-dimensional configurations. Most enzymes are of tertiary configuration. If a protein is denatured, loses its three-dimensional shape, it may no longer be functional. Environmental conditions such as temperature and pH can denature proteins. Some proteins, such as hemoglobin, are formed from several polypeptides, and the interactions of these subunits form the quaternary structure of proteins.

Information presented and the examples highlighted in the section, support concepts and Learning Objectives outlined in Big Idea 4 of the AP ® Biology Curriculum Framework. The Learning Objectives listed in the Curriculum Framework provide a transparent foundation for the AP ® Biology course, an inquiry-based laboratory experience, instructional activities, and AP ® exam questions. A Learning Objective merges required content with one or more of the seven science practices.

Big Idea 4 | Biological systems interact, and these systems and their interactions possess complex properties. |

Enduring Understanding 4.A | Interactions within biological systems lead to complex properties. |

Essential Knowledge | 4.A.1 The subcomponents of biological molecules and their sequence determine the properties of that molecule. |

Science Practice | 7.1 The student can connect phenomena and models across spatial and temporal scales. |

Learning Objective | 4.1 The student is able to explain the connection between the sequence and the subcomponents of a biological polymer and its properties. |

Essential Knowledge | 4.A.1 The subcomponents of biological molecules and their sequence determine the properties of that molecule. |

Science Practice | 1.3 The student can refine representations and models of natural or man-made phenomena and systems in the domain. |

Learning Objective | 4.2 The student is able to refine representations and models to explain how the subcomponents of a biological polymer and their sequence determine the properties of that polymer. |

Essential Knowledge | 4.A.1 The subcomponents of biological molecules and their sequence determine the properties of that molecule. |

Science Practice | 6.1 The student can justify claims with evidence. |

Science Practice | 6.4 The student can make claims and predictions about natural phenomena based on scientific theories and models. |

Learning Objective | 4.3 The student is able to use models to predict and justify that changes in the subcomponents of a biological polymer affect the functionality of the molecules. |

### Teacher Support

Twenty amino acids can be formed into a nearly limitless number of different proteins. The sequence of the amino acids ultimately determines the final configuration of the protein chain, giving the molecule its specific function.

### Teacher Support

Emphasize that proteins have a variety of functions in the body. Table 3.1 contains some examples of these functions. Note that not all enzymes work under the same conditions. Amylase only works in an alkaline medium, such as in saliva, while pepsin works in the acid environment of the stomach. Discuss other materials that can be carried by protein in body fluids in addition to the substances listed for transport in the text. Proteins also carry insoluble lipids in the body and transport charged ions, such as calcium, magnesium, and zinc. Discuss another important structural protein, collagen, as it is found throughout the body, including in most connective tissues. Emphasize that not all hormones are proteins and that steroid based hormones were discussed in the previous section.

The amino group of an amino acid loses an electron and becomes positively charged. The carboxyl group easily gains an electron, becoming negatively charged. This results in the amphipathic characteristic of amino acids and gives the compounds solubility in water. The presence of both functional groups also allows dehydration synthesis to join the individual amino acids into a peptide chain.

Protein structure is explained as though it occurs in three to four discrete steps. In reality, the structural changes that result in a functional protein occur on a continuum. As the primary structure is formed off the ribosomes, the polypeptide chain goes through changes until the final configuration is achieved. Have the students imagine a strand of spaghetti as it cooks in a clear pot. Initially, the strand is straight (ignore the stiffness for this example). While it cooks, the strand will bend and twist and (again, for this example), fold itself into a loose ball made up of the strand of pasta. The resulting strand has a particular shape. Ask the students what types of chemical bonds or forces might affect protein structure. These shapes are dictated by the position of amino acids along the strand. Other forces will complete the folding and maintain the structure.

The Science Practice Challenge Questions contain additional test questions for this section that will help you prepare for the AP exam. These questions address the following standards:

[APLO 1.14] [APLO 2.12] [APLO 4.1] [APLO 4.3][APLO 4.15][APLO 4.22]

### Types and Functions of Proteins

Proteins are one of the most abundant organic molecules in living systems and have the most diverse range of functions of all macromolecules. Proteins may be structural, regulatory, contractile, or protective they may serve in transport, storage, or membranes or they may be toxins or enzymes. Each cell in a living system may contain thousands of proteins, each with a unique function. Their structures, like their functions, vary greatly. They are all, however, polymers of amino acids, arranged in a linear sequence.

Enzymes , which are produced by living cells, are catalysts in biochemical reactions (like digestion) and are usually complex or conjugated proteins. Each enzyme is specific for the substrate (a reactant that binds to an enzyme) it acts on. The enzyme may help in breakdown, rearrangement, or synthesis reactions. Enzymes that break down their substrates are called catabolic enzymes, enzymes that build more complex molecules from their substrates are called anabolic enzymes, and enzymes that affect the rate of reaction are called catalytic enzymes. It should be noted that all enzymes increase the rate of reaction and, therefore, are considered to be organic catalysts. An example of an enzyme is salivary amylase, which hydrolyzes its substrate amylose, a component of starch.

Hormones are chemical-signaling molecules, usually small proteins or steroids, secreted by endocrine cells that act to control or regulate specific physiological processes, including growth, development, metabolism, and reproduction. For example, insulin is a protein hormone that helps to regulate the blood glucose level. The primary types and functions of proteins are listed in Table 3.1.

Type | Examples | Functions |
---|---|---|

Digestive Enzymes | Amylase, lipase, pepsin, trypsin | Help in digestion of food by catabolizing nutrients into monomeric units |

Transport | Hemoglobin, albumin | Carry substances in the blood or lymph throughout the body |

Structural | Actin, tubulin, keratin | Construct different structures, like the cytoskeleton |

Hormones | Insulin, thyroxine | Coordinate the activity of different body systems |

Defense | Immunoglobulins | Protect the body from foreign pathogens |

Contractile | Actin, myosin | Effect muscle contraction |

Storage | Legume storage proteins, egg white (albumin) | Provide nourishment in early development of the embryo and the seedling |

Proteins have different shapes and molecular weights some proteins are globular in shape whereas others are fibrous in nature. For example, hemoglobin is a globular protein, but collagen, found in our skin, is a fibrous protein. Protein shape is critical to its function, and this shape is maintained by many different types of chemical bonds. Changes in temperature, pH, and exposure to chemicals may lead to permanent changes in the shape of the protein, leading to loss of function, known as denaturation . All proteins are made up of different arrangements of the most common 20 types of amino acids.

### Amino Acids

Amino acids are the monomers that make up proteins. Each amino acid has the same fundamental structure, which consists of a central carbon atom, also known as the alpha (*α*) carbon, bonded to an amino group (NH_{2}), a carboxyl group (COOH), and to a hydrogen atom. Every amino acid also has another atom or group of atoms bonded to the central atom known as the R group (Figure 3.24).

The name "amino acid" is derived from the fact that they contain both amino group and carboxyl-acid-group in their basic structure. As mentioned, there are 20 common amino acids present in proteins. Nine of these are considered essential amino acids in humans because the human body cannot produce them and they are obtained from the diet. For each amino acid, the R group (or side chain) is different (Figure 3.25).

### Visual Connection

- Polar and charged amino acids will be found on the surface. Non-polar amino acids will be found in the interior.
- Polar and charged amino acids will be found in the interior. Non-polar amino acids will be found on the surface.
- Non-polar and uncharged proteins will be found on the surface as well as in the interior.

The chemical nature of the side chain determines the nature of the amino acid (that is, whether it is acidic, basic, polar, or nonpolar). For example, the amino acid glycine has a hydrogen atom as the R group. Amino acids such as valine, methionine, and alanine are nonpolar or hydrophobic in nature, while amino acids such as serine, threonine, and cysteine are polar and have hydrophilic side chains. The side chains of lysine and arginine are positively charged, and therefore these amino acids are also known as basic amino acids. Proline has an R group that is linked to the amino group, forming a ring-like structure. Proline is an exception to the standard structure of an animo acid since its amino group is not separate from the side chain (Figure 3.25).

Amino acids are represented by a single upper case letter or a three-letter abbreviation. For example, valine is known by the letter V or the three-letter symbol val. Just as some fatty acids are essential to a diet, some amino acids are necessary as well. They are known as essential amino acids, and in humans they include isoleucine, leucine, and cysteine. Essential amino acids refer to those necessary for construction of proteins in the body, although not produced by the body which amino acids are essential varies from organism to organism.

The sequence and the number of amino acids ultimately determine the protein's shape, size, and function. Each amino acid is attached to another amino acid by a covalent bond, known as a peptide bond , which is formed by a dehydration reaction. The carboxyl group of one amino acid and the amino group of the incoming amino acid combine, releasing a molecule of water. The resulting bond is the peptide bond (Figure 3.26).

The products formed by such linkages are called peptides. As more amino acids join to this growing chain, the resulting chain is known as a polypeptide. Each polypeptide has a free amino group at one end. This end is called the N terminal, or the amino terminal, and the other end has a free carboxyl group, also known as the C or carboxyl terminal. While the terms polypeptide and protein are sometimes used interchangeably, a polypeptide is technically a polymer of amino acids, whereas the term protein is used for a polypeptide or polypeptides that have combined together, often have bound non-peptide prosthetic groups, have a distinct shape, and have a unique function. After protein synthesis (translation), most proteins are modified. These are known as post-translational modifications. They may undergo cleavage or phosphorylation, or may require the addition of other chemical groups. Only after these modifications is the protein completely functional.

### Link to Learning

Click through the steps of protein synthesis in this interactive tutorial.

## 5. Relative differences, ratios, and correlations

### 5.1. Comparing relative versus incremental differences

It is common in biology for relative changes to be more germane than incremental ones. There are two principal reasons for this. One is that certain biological phenomena can only be properly described and understood through relative changes. For example, if we were to count the number of bacterial cells in a specified volume of liquid culture every hour, we might derive the following numbers: 1,000, 2,000, 4,000, 8,000, 16,000. The pattern is clear the cells are doubling every hour. Conversely, it would be ridiculous to take the mean of the observed changes in cell number and to state that, on average, the cells increase by 3,750 each hour with a 95% CI of ,174.35 to 8,674.35! The second reason is due to experimental design. There are many instances where variability between experiments or specimens makes it difficult, if not impossible, to pool mean values from independent repeats in a productive way. Rather, the ratio of experimental and control values within individual experiments or specimens should be our focus. One example of this involves quantifying bands on a western blot and is addressed below. Most traditional statistical approaches, however, are oriented toward the analysis of incremental changes (i.e., where change is measured by subtraction). Thus, it may not always be clear how to analyze data when the important effects are relative.

As an example of a situation in which ratios are likely to be most useful, we consider the analysis of a western blot (Figure 13) (also see Gassmann et al., 2009). This schematic blot shows the outcome of an experiment designed to test the hypothesis that loss of gene *y* activity leads to changes in the expression of protein X in *C. elegans* . In one scenario, the three blots (A–C) could represent independent biological repeats with lanes 1 serving as technical (e.g., loading) repeats. In another scenario, the three blots could serve as technical repeats with lanes 1 representing independent biological repeats. Regardless, either scenario will give essentially the same result. Quantification of each band, based on pixel intensity, is indicated in blue 46 . Based on Figure 13, it seems clear that loss of gene *y* leads to an increase in the amount of protein X. So does the statistical analysis agree?

**Figure 13. Representative western blot analysis.**

Figure 14 shows the results of carrying out the statistical analysis in several different ways. This includes, for illustrative purposes, seven distinct two-tailed *t* -tests (1). In *t* -tests 1, data from wild-type and *mut y* bands were pooled within individual blots to obtain an average. Interestingly, only one of the three, blot A, showed a statistically significant difference ( *P* ≤ 0.05) between wild type and *mut y* , despite all three blots appearing to give the same general result. In this case, the failure of blots B and C to show a significant difference is due to slightly more variability between samples of the same kind (i.e., wild type or *mut* *y* ) and because with an *n* = 3, the power of the *t* -test to detect small or even moderate differences is weak. The situation is even worse when we combine subsets of bands from different blots, such as pooled lanes 1, 2, and 3 ( *t* -tests 4). Pooling all of the wild-type and *mut y* data ( *t* -test 7) does, however, lead to a significant difference ( *P* = 0.0065).

**Figure 14. Statistical analysis of western blot data from Figure 13.** A summary of test options is shown.

So was the *t* -test the right way to go? Admittedly, it probably wasn't very satisfying that only one of the first three *t* -tests indicated a significant difference, despite the raw data looking similar for all three. In addition, we need to consider whether or not pooling data from different blots was even kosher. On the one hand, the intensity of the protein X band in any given lane is influenced by the concentration of protein X in the lysate, which is something that we care about. On the other hand, the observed band intensity is also a byproduct of the volume of lysate loaded, the efficiency of protein transfer to the membrane, the activity of the radiolabel or enzymes used to facilitate visualization, and the length of the exposure time, none of which are relevant to our central question! In fact, in the case of western blots, comparing intensities across different blots is really an apples and oranges proposition, and thus pooling such data violates basic principles of logic 47 . Thus, even though pooling all the data ( *t* -test 7) gave us a sufficiently low *P* -value as to satisfy us that there is a statistically significant difference in the numbers we entered, the premise for combining such data was flawed scientifically.

The last test shown in Figure 14 is the output from confidence interval calculations for two ratios. This test was carried out using an Excel tool that is included in this chapter 48 . To use this tool, we must enter for each paired experiment the mean (termed “estimate”) and the SE (“SE of est”) and must also choose a confidence level (Figure 15). Looking at the results of the statistical analysis of ratios (Figure 14), we generally observe much crisper results than were provided by the *t* -tests. For example, in the three cases where comparisons were made only within individual blots, all three showed significant differences corresponding to *P* < 0.05 and two (blots A and B) were significant to *P* < 0.01 49 . In contrast, as would be expected, combining lane data between different blots to obtain ratios did not yield significant results, even though the ratios were of a similar magnitude to the blot-specific data. Furthermore, although combining all values to obtain means for the ratios did give *P* < 0.05, it was not significant at the α level of 0.01. In any case, we can conclude that this statistical method for acquiring confidence intervals for ratios is a clean and powerful way to handle this kind of analysis.

**Figure 15. Confidence interval calculator for a ratio.**

It is also worth pointing out that there is another way in which the *t* -test could be used for this analysis. Namely, we could take the ratios from the first three blots (3.33, 3.41, and 2.48), which average to 3.07, and carry out a one-sample two-tailed *t* -test. Because the null hypothesis is that there is no difference in the expression of protein X between wild-type and *mut y* backgrounds, we would use an expected ratio of 1 for the test. Thus, the *P* -value will tell us the probability of obtaining a ratio of 3.07 if the expected ratio is really one. Using the above data points, we do in fact obtain *P* = 0.02, which would pass our significance cutoff. In fact, this is a perfectly reasonable use of the *t* -test, even though the test is now being carried out on ratios rather than the unprocessed data. Note, however, that changing the numbers only slightly to 3.33, 4.51, and 2.48, we would get a mean of 3.44 but with a corresponding *P* -value of 0.054. This again points out the problem with *t* -tests when one has very small sample sizes and moderate variation within samples.

### 5.2. Ratio of means versus mean of ratios

There is also an important point to be made with respect to ratios that concerns the mean value that you would report. Based on the above paragraph, the mean of the three ratios is 3.07. However, looking at *t* -test 7, which used the pooled data, we can see that the ratio calculated from the total means would be 477/167 = 2.86. This points out a rather confounding property of ratio arithmetic. Namely, that the mean of the ratios (MoR in this case 3.07) is usually not equal to the ratio of the means (RoM 2.86). Which of the two you choose to use to report will depend on the question you are trying to answer.

To use a non-scientific (but intuitive) example, we can consider changes in housing prices over time. In a given town, the current appraised value of each house is compared to its value 20 years prior. Some houses may have doubled in value, whereas others may have quadrupled (Table 5). Taking an average of the ratios for individual houses (i.e., the relative increase)—the MoR approach—allows us to determine that the mean increase in value has been 3-fold. However, it turns out that cheaper houses (those initially costing ≤$100,000) have actually gone up about 4-fold on average, whereas more-expensive homes (those initially valued at ≥$300,000) have generally only doubled. Thus, the total increase in the combined value of all the homes in the neighborhood has not tripled but is perhaps 2.5-fold higher than it was 20 years ago (the RoM approach).

Table 5. A tinker-toy illustration for increases in house prices in TinyTown (which has only two households).

Before | After | Relative Increase | |
---|---|---|---|

$100,000 | $400,000 | 4 | |

$300,000 | $600,000 | 2 | |

Means | $200,000 | $500,000 | MoR↓ |

RoM→ | 2.5 | 3 |

Which statistic is more relevant? Well, if you're the mayor and if property taxes are based on the appraised value of homes, your total intake will be only 2.5 times greater than it was 20 years ago. If, on the other hand, you are writing a newspaper article and want to convey the extent to which average housing prices have increased over the past 20 years, 3-fold would seem to be a more salient statistic. In other words, MoR tells us about the average effect on individuals, whereas RoM conveys the overall effect on the population as a whole. In the case of the western blot data, 3.07 (i.e., the MoR) is clearly the better indicator, especially given the stated issues with combining data from different blots. Importantly, it is critical to be aware of the difference between RoM and MoR calculations and to report the statistic that is most relevant to your question of interest.

### 5.3. Log scales

Data from studies where relative or exponential changes are pervasive may also benefit from transformation to log scales. For example, transforming to a log scale is the standard way to obtain a straight line from a slope that changes exponentially. This can make for a more straightforward presentation and can also simplify the statistical analysis (see Section 6.4 on outliers). Thus, transforming 1, 10, 100, 1,000 into log_{10} gives us 0, 1, 2, 3. Which log base you choose doesn't particularly matter, although ten and two are quite intuitive, and therefore popular. The natural 50 log (.718), however, has historical precedent within certain fields and may be considered standard. In some cases, back transformation (from log scale to linear) can be done after the statistical analysis to make the findings clearer to readers.

### 5.4. Correlation and modeling

For some areas of research (such as ecology, field biology, and psychology), modeling, along with its associated statistics, is a predominant form of analysis. This is not the case for most research conducted in the worm field or, for that matter, by most developmental geneticists or molecular biologists. Correlation, in contrast, can be an important and useful concept. For this reason, we include a substantive, although brief, section on correlation, and a practically non-existent section on modeling.

Correlation describes the co-variation of two variables. For example, imagine that we have a worm strain that expresses reporters for two different genes, one labeled with GFP, the other with mCherry 51 . To see if expression of the two genes is correlated, GFP and mCherry are measured in 50 individual worms, and the data are plotted onto a graph known as a *scatterplot* . Here, each worm is represented by a single dot with associated GFP and mCherry values corresponding to the *x* and *y* axes, respectively (Figure 16). In the case of a positive correlation, the cloud of dots will trend up to the right. If there is a negative correlation, the dots will trend down to the right. If there is little or no correlation, the dots will generally show no obvious pattern. Moreover, the closer the dots come to forming a unified tight line, the stronger the correlation between the variables. Based on Figure 16, it would appear that there is a positive correlation, even if the dots don't fall exactly on a single line. Importantly, it matters not which of the two values (GFP or mCherry) is plotted on the *x* or the *y* axes. The results, including the statistical analysis described below, will come out exactly the same.

**Figure 16. Scatterplot of GFP expression versus mCherry.** The correlation coefficient is .68. The units on the axes are arbitrary.

The extent of correlation between two variables can be quantified through calculation of a statistical parameter termed the *correlation coefficient* (a.k.a. *Pearson's product moment correlation coefficient* , *Pearson's r* , or just *r* ). The formula is a bit messy and the details are not essential for interpretation. The value of *r* can range from (a perfect negative correlation) to 1 (a perfect positive correlation), or can be 0 52 in the case of no correlation. Thus, depending on the tightness of the correlation, values will range from close to zero (weak or no correlation) to 1 or (perfect correlation). In our example, if one of the two genes encodes a transcriptional activator of the other gene, we would expect to see a positive correlation. In contrast, if one of the two genes encodes a repressor, we should observe a negative correlation. If expression of the two genes is in no way connected, *r* should be close to zero, although random chance would likely result in *r* having either a small positive or negative value. Even in cases where a strong correlation is observed, however, it is important not to make the common mistake of equating correlation with causation 53 .

Like other statistical parameters, the SD, SE, and 95% CI can be calculated for *r* . In addition, a *P* -value associated with a given *r* can be determined, which answers the following question: What is the probability that random chance resulted in a correlation coefficient as far from zero as the one observed? Like other statistical tests, larger sample sizes will better detect small correlations that are statistically significant. Nevertheless, it is important to look beyond the *P* -value in assessing biological significance, especially if *r* is quite small. The validity of these calculations also requires many of the same assumptions described for other parametric tests including the one that the data have something close to a normal distribution. Furthermore, it is essential that the two parameters are measured separately and that the value for a given *x* is not somehow calculated using the value of *y* and vice versa.

Examples of six different scatterplots with corresponding *r* and *P* -values are shown in Figure 17. In addition to these values, a black line cutting through the swarm of red dots was inserted to indicate the slope. This line was determined using a calculation known as the *least squares* or *linear least squares method* . The basic idea of this method is to find a straight line that best represents the trend indicated by the data, such that a roughly equal proportion of data points is observed above and below the line. Finally, blue dashed lines indicate the 95% CI of the slope. This means that we can be 95% certain that the true slope (for the population) resides somewhere between these boundaries.

**Figure 17. Fits and misfits of regression lines.** The units on the axes are arbitrary.

Panels A–C of Figure 17 show examples of a strong ( *r* = 0.86), weak ( *r* = .27), and nonexistent ( *r* = 0.005) correlation, respectively. The purple line in panel B demonstrates that a slope of zero can be fit within the 95% CI, which is consistent with the observed *P* -value of 0.092. Panel D illustrates that although small-sized samples can give the impression of a strong correlation, the *P* -value may be underwhelming because chance sampling could have resulted in a similar outcome. In other words, similar to SD, *r* is not affected by sample size 54 , but the *P* -value most certainly will be. Conversely, a large sample size will detect significance even when the correlation coefficient is relatively weak. Nevertheless, for some types of studies, a small correlation coefficient with a low *P* -value might be considered scientifically important. Panels E and F point out the dangers of relying just on *P* -values without looking directly at the scatterplot. In Panel E, we have both a reasonably high value for *r* along with a low P-value. Looking at the plot, however, it is clear that a straight line is not a good fit for these data points, which curve up to the right and eventually level out. Thus, the reported *r* and *P* -values, though technically correct, would misrepresent the true nature of the relationship between these variables. In the case of Panel F, *r* is effectively zero, but it is clear that the two variables have a very strong relationship. Such examples would require additional analysis, such as modeling, which is described briefly below.

Another very useful thing about *r* is that it can be squared to give *R 2* (or *r 2* ), also called the *coefficient of determination* . The reason *R 2* is useful is that it allows for a very easy interpretation of the relationship between the two variables. This is best shown by example. In the case of our GFP/mCherry experiment, we obtained *r* = 0.68, and squaring this gives us 0.462. Thus, we can say that 46.2% of the variability in mCherry can be explained by differences in the levels of GFP. The rest, 53.8%, is due to other factors. Of course, we can also say that 46.2% of the variability in GFP can be explained by differences in the levels of mCherry, as the *R 2* itself does not imply a direction. Of course if GFP is a reporter for a transcription factor and mCherry is a reporter for a structural gene, a causal relationship, along with a specific regulatory direction, is certainly suggested. In this case, additional experiments would have to be carried out to clarify the underlying biology.

### 5.5. Modeling and regression

The basic idea behind *modeling* and *regression* methods is to come up with an equation that can make useful predictions or describe the behavior of a system. In *simple linear regression* , a single *predictor* or *independent variable* , such as the GFP intensity of a heat-shock reporter, might be used to predict the behavior of a *response* or *dependent variable* , such as the life span of a worm 55 . The end result would be an equation 56 that describes a line that is often, although not always, straight 57 . *Multiple regression* is an extension of simple linear regression, but it utilizes two or more variables in the prediction 58 . A classic example of multiple regression used in many statistics texts and classes concerns the weight of bears. Because it's not practical to weigh bears in the field, proxy measures such as head circumference, body length, and abdominal girth are acquired and fitted to an equation (by a human-aided computer or a computer-aided human), such that approximate weights can be inferred without the use of a scale. Like single and multiple linear regression, *nonlinear regression* also fits data (i.e., predictive variables) to a curve that can be described by an equation. In some cases, the curves generated by nonlinear regression may be quite complex. Unlike linear regression, nonlinear regression cannot be described using simple algebra. Nonlinear regression is an *iterative* method, and the mathematics behinds its workings are relatively complex. It is used in a number of fields including pharmacology. *Logistic regression* uses one or more factors to predict the probability or odds of a *binary* or *dichotomous* outcome, such as life or death. It is often used to predict or model mortality given a set of factors it is also used by employers in decisions related to hiring or by government agencies to predict the likelihood of criminal recidivism 59 .

46 Admittedly, standard western blots would also contain an additional probe to control for loading variability, but this has been omitted for simplification purposes and would not change the analysis following adjustments for differences in loading.

47 A similar, although perhaps slightly less stringent argument, can be made against averaging cycle numbers from independent qRT-PCR runs. Admittedly, if cDNA template loading is well controlled, qRT-PCR cycle numbers are not as prone to the same arbitrary and dramatic swings as bands on a western. However, subtle differences in the quality or amount of the template, chemical reagents, enzymes, and cycler runs can conspire to produce substantial differences between experiments.

48 This Excel tool was developed by KG.

49 The maximum possible P-values can be inferred from the CIs. For example, if a 99% CI does not encompass the number one, the ratio expected if no difference existed, then you can be sure the P-value from a two-tailed test is <0.01.

50 Admittedly, there is nothing particularly “natural” sounding about 2.718281828…

51 An example of this is described in Doitsidou et al., 2007

52 In the case of no correlation, the least-squares fit (which you will read about in a moment) will be a straight line with a slope of zero (i.e., a horizontal line). Generally speaking, even when there is no real correlation, however, the slope will always be a non-zero number because of chance sampling effects.

53 For example, nations that that supplement their water with fluoride have higher cancer rates. The reason is not because fluoride is mutagenic. It is because fluoride supplements are carried out by wealthier countries where health care is better and people live longer. Since cancer is largely a disease of old age, increased cancer rates in this case simply reflect a wealthier long-lived population. There is no meaningful cause and effect. On a separate note, it would not be terribly surprising to learn that people who write chapters on statistics have an increased tendency to become psychologically unhinged (a positive correlation). One possibility is that the very endeavor of a writing about statistics results in authors becoming mentally imbalanced. Alternatively, volunteering to write a statistics chapter might be a symptom of some underlying psychosis. In these scenarios cause and effect could be occurring, but we don't know which is the cause and which is the effect.

54 In truth, SD is affected very slightly by sample size, hence SD is considered to be a “biased” estimator of variation. The effect, however, is small and generally ignored by most introductory texts. The same is true for the correlation coefficient, r.

55 Rea et al. (2005) Nat. Genet. 37, 894-898. In this case, the investigators did not conclude causation but nevertheless suggested that the reporter levels may reflect a physiological state that leads to greater longevity and robust health. Furthermore, based on the worm-sorting methods used, linear regression was not an applicable outcome of their analysis.

56 The standard form of simple linear regression equations takes the form y = b_{1}x + b_{0}, where y is the predicted value for the response variable, x is the predictor variable, b_{1} is the slope coefficient, and b_{0} is the y-axis intercept. Thus, because b1 and b0 are known constants, by plugging in a value for x, y can be predicted. For simple linear regression where the slope is a straight line, the slope coefficient will be the same as that derived using the least-squares method.

57 Although seemingly nonsensical, the output of a linear regression equation can be a curved line. The confusion is the result of a difference between the common non-technical and mathematical uses of the term “linear”. To generate a curve, one can introduce an exponent, such as a square, to the predictor variable (e.g., x 2 ). Thus, the equation could look like this: y = b_{1}x 2 + b0.

58 A multiple regression equation might look something like this: Y = b_{1}X_{1} + b_{2}X_{2} - b_{3}X_{3} + b_{0}, where X_{1-3} represent different predictor variables and b_{1-3} represent different slope coefficients determined by the regression analysis, and b0 is the Y-axis intercept. Plugging in the values for X_{1-3}, Y could thus be predicted.

59 Even without the use of logistic regression, I can predict with near 100% certainty that I will never agree to author another chapter on statistics! (DF)

### What Is It?

Data analysis is the process of interpreting the meaning of the data we have collected, organized, and displayed in the form of a table, bar chart, line graph, or other representation. The process involves looking for patterns—similarities, disparities, trends, and other relationships—and thinking about what these patterns might mean.

When analyzing data, ask students questions such as:

What does this graph tell you?

Who could use this data? How could they use it?

Why is this data shown in a line graph?

The process of collecting, organizing, and analyzing data is not always a simple, sequential process sometimes a preliminary analysis of a data set may prompt us to look at the data in another way, or even to go back and collect additional data to test an emerging hypothesis. For example, students could survey their classmates on how they are transported to school (such as by car, by bus, by foot, or another way), and then display the data in a circle graph.

After analyzing the data in this graph, students might look at the data in a different way. Students might be interested in finding out more about people who are transported to school by car. Why do they ride in a car to school? Are they on a bus route? Do they carpool with other students? Are they close enough to school to walk, but choose to ride? Is the neighborhood between home and school too dangerous to walk through? Do the people who walk sometimes ride in a car, also? They might discover that most students in the "other" category ride their bikes to school, and decide to create an additional category.

In all grades, students look at graphical displays and describe them by identifying aspects such as the greatest value, the least value, and the relationship of one data point to another. Students in the intermediate grades learn how to summarize or characterize a data set in greater depth by determining the range and two measures of center, the mode and median. Students in the upper grades learn to find the third measure of center, the mean, and also to determine quartiles, identify outliers, and, for scatterplots, calculate a line or curve of best fit and describe any resulting correlation. High-school students should be able to design their own investigations that include effective sampling, representative data, and an unbiased interpretation of the results.

At every grade level, you should encourage students to think about the meaning of the data they have collected and displayed. The crucial question is "Why?"

### Why Is It Important?

The ability to make inferences and predictions based on data is a critical skill students need to develop.

Data analysis is crucial to the development of theories and new ideas. By paying close attention to patterns, the stories behind outliers, relationships between and among data sets, and the external factors that may have affected the data, students may come to have a deeper understanding of the crucial distinction between theory and evidence.

## Quartiles

Another related idea is Quartiles, which splits the data into quarters:

### Example: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8

The numbers are in order. Cut the list into quarters:

In this case Quartile 2 is half way between 5 and 6:

The Quartiles also divide the data into divisions of 25%, so:

- Quartile 1 (Q1) can be called the 25th percentile
- Quartile 2 (Q2) can be called the 50th percentile
- Quartile 3 (Q3) can be called the 75th percentile

### Example: (continued)

For **1, 3, 3, 4, 5, 6, 6, 7, 8, 8**:

- The 25th percentile =
**3** - The 50th percentile =
**5.5** - The 75th percentile =
**7**

## Search algorithms

### Depth-first search

Depth-first search (DFS) is an algorithm that visits all edges in a graph G that belong to the same connected component as a vertex v .

Before running the algorithm, all |V| vertices must be marked as not visited.

#### Time complexity

To compute the time complexity, we can use the number of calls to DFS as an elementary operation: the if statement and the mark operation both run in constant time, and the for loop makes a single call to DFS for each iteration.

Let E&apos be the set of all edges in the connected component visited by the algorithm. The algorithm makes two calls to DFS for each edge <*u*, *v*> in E&apos: one time when the algorithm visits the neighbors of *u*, and one time when it visits the neighbors of *v*.

Hence, the time complexity of the algorithm is &Theta(|V| + |E&apos|).

### Breadth-first search

Breadth-first search (BFS) also visits all vertices that belong to the same component as *v*. However, the vertices are visited in distance order: the algorithm first visits *v*, then all neighbors of *v*, then their neighbors, and so on.

Before running the algorithm, all |V| vertices must be marked as not visited.

#### Time complexity

The time complexity of BFS can be computed as the total number of iterations performed by the for loop.

Let E&apos be the set of all edges in the connected component visited by the algorithm. For each edge <*u*, *v*> in E&apos the algorithm makes two for loop iteration steps: one time when the algorithm visits the neighbors of *u*, and one time when it visits the neighbors of *v*.

Hence, the time complexity is &Theta(|V| + |E&apos|).

### Dijkstra’s algorithm

Dijkstra’s algorithm computes the shortest path from a vertex s , the source, to all other vertices. The graph must have non-negative edge costs.

The algorithm returns two arrays:

- dist[k] holds the length of a shortest path from s to k ,
- prev[k] holds the previous vertex in a shortest path from s to k .

#### Time complexity

To compute the time complexity we can use the same type of argument as for BFS .

The main difference is that we need to account for the cost of adding, updating and finding the minimum distances in the queue. If we implement the queue with a heap, all of these operations can be performed in *O*(log |V|) time.

It is good idea. It is ready to support you.

I mean you are not right. Enter we'll discuss it.

The authoritative answer, it is tempting...

I congratulate, the magnificent thought

This situation is familiar to me. I invite you to a discussion.