Information

Can the value of heritability be greater than 1?

Can the value of heritability be greater than 1?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Heritability defined as genetic variance divided by total variance seems to be bounded between 0 and 1. However, I see a way of calculating heritability on this page (http://www.radford.edu/~rsheehy/Gen_flash/Tutorials/Linear_Regression/reg-tut.htm) with: 2*Covxy / Varx. However, a result calculated in this way could be greater than 1. What am I missing?


Can the value of heritability be greater than 1? - Biology

A polygenic trait usually demonstrates a great deal of phenotypic variability in a population. Some of this variation is genetic, VG, due to the different genotypic classes in the population and to dominance and epistatic effects. Some of the variation is environmental influence on the genes, VE. (There may also be gene-environemntal interaction variance we will, for now, assume this to be 0). Thus, there are two major components of the total phenotypic variability observed in a sample, and they are additive:

VP (total phenotypic variance) = VG (genetic variance)+ VE(environmental variance)

Scientists and others have long been interested in methods to measure these quantities. VP is the total statisitcal variance for the trait in the population and can be statistically calculated from population data. However, it is not easy to determine VG and VE even when estimated values are obtained, their accuracy is in doubt and they pertain only to the population at that one time, in that one environemnt. The values would likely change if the environment changed or if a different sample (from a different population) were to be measured for that trait. Since one can't partition an individual into genetic versus environmental components of a trait, measures of variability in samples of populations are used, even though these measures need to be treated with a great deal of caution.

The heritiability of a trait, H, is defined as the fraction of the total variance for that trait that is genetic.

Notice that H is the fraction of the variability that is genetic it is not the fraction of the trait that is genetically determined. (That cannot be measured). Notice that if there were no genetic variation (all individuals having the same genotype), then VG = 0, and H = 0. Conversely, if there were no environmental variability (each individual subject to identical environmental influences), then VE = 0 and H = 1. These are the theoretical limits for the H value. Of course, there always is some VG and some VE for any polygenic trait, so H would lie somewhere between 0 and 1.

It is extremely important to interpret the H value correctly. An H value of 0.5 or greater is considered a trait with high heritability -- most of the variance for the trait is genetic. This is not the same as saying that the trait is mostly genetically determined the trait may, in fact, not be influenced much by the genes at all, but still happen to have a diversity of genotypes in the population and not much environmental variation. Consequently, VG comes out higher than VE, giving a high H value. Conversely, an H value of less than 0.2 is considered a low heritability value, meaning that most of the variance for the trait is environmental. Again, it is the variability that is being judged, not necessarily the trait itself. It could be that the environment does not influence the trait all that much, but if there is almost no genetic variation in the population for this trait, then what little VE there is still accounts for most of the variation, giving a low H value.

Because the environmental influences are so diverse from one population to another, H values apply only to the population on which they are measured at the time of the measurement. They should never be extended to a comparison of populations, unless the environments of the populations are identical. With humans, for example, this is impossible to claim. When someone says that H = 0.8 for human IQ and, therefore, IQ is mostly genetically determined, and, therefore, any differences between populations (races) in IQ are genetically based, this person is making at least three errors by making this claim:

1) H = 0.8 is the value obtained for samples of Caucasian populations it applies only to the specific population (s) sampled in the test. African Americans have not been as extensively tested as Caucasians. If one were to compare them, one would have to assume identical environmental influences on IQ for African Americans as for Caucasians. Obviously, African Americans and U.S. Caucasians have different cultural and socio-economic realities in the U.S. The issue, then, is whether cultural and socio-economic differences between African Americans and Caucasians are inconsequential insofar as affecting IQ scores -- probably not.

2) H = 0.8 is indeed a high value, but it means only that there is extensive genetic variability (as compared to environmental variability). This should never be interpreted as necessarily meaning that there is a large genetic component to IQ determination itself.

3) There are indeed differences in average IQ between African americans and Caucasians. To say that these differences are genetic is, however, scientifically invalid. One cannot use the H value of one population sample and compare it to another population sample unless that other population is under identical environmental influences.

Scientifically, there are insufficient data and knowledge of what determines IQ to make any definitive conclusions. There probably is a significant VGE term -- interaction between genes and environment -- so that in reality VP = VG + VE + VGE. The VGE may be bigger that either VG or VE alone -- who knows?

The genetic view on IQ determination would be something as follows: Each individual's genes set a range of possible IQ values, but where within that range (upper and lower limits) the IQ is realized depends on environmental factors. Thus, the genetic view is that (1) people differ in their genetic basis for IQ -- everyone isn't created equal -- because no two people are genetically identical, (2) environment sets IQ within each person's range potential, and (3) different genotypes (individuals) probably respond to different environments --there is no single best environment for all genotypes (the VGE).

One can get a rough idea of the extent of genetic involvement in a trait by comparing concordance values for that trait between MZ and DZ twins. Consider, for example, diabetes mellitus. MZ twin concordancy is 47% versus 9.7% for DZ twins. The comparison of MZ twins with DZ shows a genetic basis, but it also shows an environmental role, since 47% is less than 100%. In contrast, MZ and DZ twins have the same concordance value for death from acute infection if MZ = DZ concordance, then there is little if any genetic basis for the trait.

To get an H value we need V (variance) values to plug into H = VG/VE. Consider the trait "height". Measure the V in height for MZ twins. Since these are genetically identical, VMZ = VE = all environmental variance. If we use DZ twins of like sex, then we can (maybe) assume that they have the same VE as MZ twins, so we have a value for VE.

The variability value for DZ twins underestimates the overall population VG, because DZ twins have half their genes in common and, therefore, aren't as variable genetically as other members of the population. So, double the variabilty of DZ twins to extend it to the population as a whole. Also notice that VDZ = 1/2VG + VE, so VDZ - VE = 1/2 VG, or VG = 2(VDZ - VE).

Now our assumption that we can use VMZ as VE comes into play, and we get VG = 2(VDZ - VMZ). Thus, the value of H is calculated from measured VDZ and VMZ values as H = 2(VDZ - VMZ)/VP.

H values so calculated are subject to many errors, especially because of the VMZ = VE assumption: VE for DZ may, in fact, be higher than VE for MZ.

Going back to IQ, researchers have, in this manner, calculated H to be somewhere around 0.6 - 0.8. Other studies using correlation coefficients also suggest large H values. MZ twins reared together show a correlation coefficient of 0.85 (expect 1.00 if totally genetically determined). MZ twins reared apart give a value of 0.65 (less than 0.85), showing environmental component. Sibs reared apart give a value of 0.25, showing a genetic component. The data, therefore, show both genetic and environmental components involved in the determination of IQ.


Heritability 101: What is “heritability”?

Whenever a newborn arrives, there’s often a conversation about who in the family the baby resembles. “How adorable, he has his father's nose!” “Look, she has Grandma Sue’s red hair!” “He looks just like Uncle Robert when he was a baby!” These comments often continue as the child grows up. “She got her mother’s smarts.” “He got his grandpa’s good looks.” “All her musical talent comes from her dad’s side of the family.”

Children tend to resemble their parents, their siblings and to a lesser degree their extended family. You can make a good guess about how tall a child will grow up to be by averaging the height of his or her parents. These similarities among family members extend beyond height to a great many traits [1].

Why do these traits run in families? One possibility is that they share an environment. Maybe parents intentionally raise their kids to be like them. Maybe there’s “something in the water”, as the old expression goes. Additionally some of it may be genetic, the end result of the DNA passed to each child from the biological parents at conception. This would help explain why for some traits an adopted child will more strongly resemble his or her biological parents than the adoptive parents, or why twins will more strongly resemble one another than siblings.

When a trait can be passed on through genetics we call it “heritable”, since it is inherited from your biological parents. Some heritable traits, like your blood type or sickle-cell disease, are entirely determined by genetics in this way. Most traits, however, are only partially heritable. You probably have heard plenty of anecdotes about friends or family members who are unusually different from their families. The uncle who is 6 inches taller than everyone else in the family. The extrovert in a family full of introverts. These traits are heritable, but they may be affected buy a bunch of environmental factors as well.

“Heritability”, then, is a way to describe how much a trait is related to genetics. We’ll leave the question of how we estimate heritability (and many other technical details) to a later post. Instead, let’s focus first on what heritability is, what it is not (i.e. common misconceptions), and why it’s something we’re interested in studying.

What heritability is:

First a semi-formal definition: heritability is the proportion of variation in a trait explained by inherited genetic variants [2]. In other words, it’s a way to measure how much the differences in people’s DNA can explain the differences in their traits. Heritability can be between 0 (genetics explains nothing about the trait) and 1 (genetics explains everything). For example, the heritability of height is about 0.80, and the heritability of hours of sleep per night is 0.15-0.20 [3].

Heritability estimates how well we could predict a trait from genetics (if we completely understood all the relevant genetic effects). Similarly, it also tells us how well we could predict the trait in you based on that trait in your parents. Actually making this prediction from your DNA would require precisely knowing the effect of every genetic variant, which is very, very far away from being a reality. But the heritability puts an upper limit on how good that prediction could ever be as we learn more about the genetics of the trait.

Heritability measures how important genetics is to a trait. A high heritability, close to 1, indicates that genetics explain a lot of the variation in a trait between different people a low heritability, near zero, indicates that most of the variation is not genetic. Just because a trait has a high heritability doesn’t necessarily mean that there’s some specific gene that directly causes it in some obvious biological way, but it does mean that the total contribution of direct and indirect causal effects and other correlations between specific DNA variants and the trait are enough to be informative [4].

Heritability is a property of the population not the individual. When the heritability of a trait is described, it reflects how much variability in the population is a consequence of genetic factors. It does not “explain” why an individual has a disease.

Heritability is specific to how a trait was measured. Traits that are harder to measure, and thus have more random measurement error, will be less heritable (since random measurement error isn’t genetic). This can also cause differences in heritability that depend on who measures the trait (e.g. reporting something about yourself vs. diagnoses from a doctor vs. physical measurements), or between a simplified measure and the intended trait (e.g. heritability of taking Prozac vs. heritability of depression).

Heritability is specific to whom the trait was measured in. Since the heritability involves the total variation of the trait in the population, it matters in what population you’re comparing the genetic effects. The heritability of the trait in individuals from a particular country, ethnicity, range of ages/birth years, and/or socioeconomic status (among other features) may or may not be the same as that trait’s heritability in a different population that has a different genetic background and is exposed to a different environment.

What heritability is not:

Heritability is not fate. Just because a trait is heritable and exists in your parents doesn’t mean you are destined to have that trait. It may be more likely, but it’s not inevitable.

Heritability is not immutable. Since heritability reflects the balance between the effects of genetic and environmental factors, if you change the environment you can change the trait’s heritability.

Heritability does not measure our ability to affect the trait. Hair color is highly heritable, but you can dye you hair whatever color you wish (including colors you can’t inherit). BMI (body mass index) is heritable, but that doesn’t mean diet and exercise can’t have an impact. In other words, heritability is not some final statement on the power of “nature vs. nurture”.

High heritability does not mean group differences are genetic. There is a troubling history of attributing observed group differences, such as reported racial disparities in IQ scores, to genetics. As noted above, heritability is specific to the choice of measurement, population, and environment, and the heritability of a trait is not immutable. As a result, it’s not valid to use a trait’s estimated heritability as evidence for “inherent” differences between populations.

Why does heritability matter?

Estimating the heritability of a trait (in a given population) is a starting point for understanding that trait, rather than an end goal. This is even more true for the version of heritability we’ve estimated in UK Biobank, which only accounts for a portion of the total potential genetic variation influencing a trait.

For geneticists and biologists, heritability is some indication of what traits will be fruitful to study. By providing a metric for how much a trait is related to genetics as opposed to other factors, it tells us how much to consider genetics if we want to learn more about the causes for that trait.

In health care, the heritability of physical measures (BMI, blood pressure, etc) and disorders provides insight on how much family history may predict patient outcomes, and how useful genetic testing may become for predicting disease risk and treatment outcomes.

Estimating heritability is also of interest in social science, where existing research suggests many aspects of life - from personality to education to sleep schedule to how many kids you have - are at least a little bit heritable. Identifying heritability in these traits can suggest areas where our DNA, often in complex and indirect ways, is associated with social outcomes.

And for all of us, knowing the heritability of our traits provides a little more understanding of the role of our DNA in shaping who we are. Genetics is almost never as simple as “nature vs. nurture”, but studying the heritability of human traits at least provides a glimpse of how those forces interact and points the way towards what we may be able to learn from our genes in the future.


Modern Morphometrics of Medically Important Insects

16.3.3 Heritability

Heritability is depending on the genetic variability related to the trait under study, it is then depending on the population under study. Its measurement is not indispensable to the interpretation of natural metric variation, but it can provide valuable information about the adaptiveness of metric traits. In insects, morphological traits commonly have the highest heritability values compared to other trait categories such as life history, probably because the former are less concerned with fitness.

Geometric techniques allow separate estimations of size and shape heritabilities. Size in insects may show consistent heritability values ( Daly, 1992 Lehmann et al., 2006 ), so that they can be experimentally selected to constitute subpopulations genetically distinct for size ( Anderson, 1973 Partridge et al., 1994 ). Various studies examining cross-environment heritability of wing shape in Diptera produced high and stable heritability, reaching 60% or more ( Roff and Mousseau, 1987 Bitner-Mathé and Klaczko, 1999 Gilchrist and Partridge, 2001 Hoffman and Shirriffs, 2002 ). The consistent values of shape heritability suggest that a large fraction of morphometric divergence seen between natural populations of insects ( Camara et al., 2006. Henry et al., 2010 Morales et al., 2010 ) may be due to additive effects of genes.

In Ae. aegypti, shape appears to be more heritable than size. When comparing size and shape cross-environment heritability on the same populations in Ae. aegypti, much higher values for shape ( Figure 16.2 ) than for size were found, providing indirect evidence for different genetic sources of variation (Morales et al., unpublished data).

Figure 16.2 . Ae. aegypti: regression of the first relative warps (RW1) of laboratory daughters on the RW1 of corresponding field-collected mothers in a cross-environment study of the heritability of the wing shape at 18 landmarks (Morales et al., unpublished data). Lab F1, female specimens obtained after crossing field-collected specimens.


What Causes Sexual Orientation: Nature and Nurture

Sexual orientation is “an enduring pattern of emotional, romantic, and/or sexual attractions to men, women, or both sexes.” 1 In explaining what causes sexual orientation, it must be emphasized that no one can choose to be gay, bi-sexual, or heterosexual. Sexual orientation is determined by two main factors: nature and nurture. Nature refers to genes and biology, while nurture includes environmental and “social influences.” 2

One nature factor in sexual orientation is genes. This has been confirmed in scientific studies that measure heritability: the “amount of phenotypic (observable) variation in a population that is attributable to individual genetic differences.” 3 For example, if a group of individuals all receive good nutrition (share a similar environment), the differences in height will be due to genetic differences. 4 Height is a highly heritable trait.

Heritability is measured as a statistic. It is “expressed as a proportion (such as .60)”, and “the maximum value it can have is 1.00.” 5 If heritability is 1.00, “then all variation in a population is due to differences or variation between genotypes.” 6 If heritability is 0.00, then “all variation in the population comes from differences in the environments experienced by individuals.” 7 Heritability cannot be measured for an individual person, but “only to a particular group living in a particular environment” and “only to variations within a group.” 8

Heritability is estimated through twin studies. A 2010 Swedish study, using data from the Swedish Twin Registry, “undertook the largest ever population based twin study to estimate the influence of genetic and environmental effects on same-sex sexual behavior.” 9 For the male twins who had “any lifetime same-sex partner”, heritability estimates were 39%. 10 For female twins, only 󈬂-19% of same sex sexual behaviors were explained by genetic factors.” 11 The study found that genes played a much greater role in same-sex sexual orientation for men than for women.

The second factor in sexual orientation is nurture. The Swedish study found that for male twins, unique environmental factors accounted for 61% of same-sex sexual behavior. 12 For female twins, unique environmental factors accounted for 64-66% of same-sex sexual behavior, and 16-17% was due to shared environmental effects. 13 For both male and female twins, nurture played a much greater role than nature in determining a person’s sexual orientation.

It is also the consensus of psychiatrists and psychologists that nurture is an important factor in sexual orientation. According to the Royal College of Psychiatrists, “sexual orientation is determined by a combination of biological and postnatal environmental factors.” 14 Similarly, the American Psychological Association states, “There is no consensus among scientists about the exact reasons that an individual develops a heterosexual, bisexual, gay, or lesbian orientation… No findings have emerged that permit scientists to conclude that sexual orientation is determined by any particular factor or factors. Many think that nature and nurture both play complex roles.” 15


Genetic vs. heritable trait

When someone tells you that height is 80% heritable , does that mean: a) 80% of the reason you are the height you are is due to genes b) 80% of the variation within the population on the trait of height is due to variation of the genes The answer is of course b . Unfortunately in the 5 years I’ve been blogging the conception of heritability has been rather difficult to get across, and I regularly have to browbeat readers who conflate the term with a . That is, they assume that if I say that a trait is mostly heritable I mean that its development is mostly a function of genes. In reality not only is that false, it’s incoherent. Heritability is addressing the population level correlation between phenotypic variation and genotypic variation. In other words, how well can genetic variation work as a proxy for phenotypic variation? What proportion of the phenotypic variation can be accounted for by genotypic variation? The key terms here are population level and variation (or technically, variance ). We are not usually talking about individuals and we are restricting our discussion to traits which vary within the population.


Results

In this section, we use empirical data and simulations of the toy-model to show that most of the heritability estimators borrowed from classical quantitative genetics are prone to significant bias, because they neglect or inaccurately model the change in resemblance between transmission partners caused by within-host evolution of the pathogen. Based on the toy-model simulations, we designate the intraclass correlation in the closest phylogenetic pairs (CPPs) and the phylogenetic heritability, H OU 2 ( t ¯ ) ⁠ , measured by the phylogenetic Ornstein–Uhlenbeck mixed model (POUMM) ( Mitov and Stadler 2016 Blanquart et al. 2017) as the most reliable estimators of pathogen trait heritability. Based on applying these estimators to a large HIV cohort, we establish a lower bound for the lg(spVL)-heritability.

Through the rest of the article, we use the symbol dij to denote the phylogenetic distance between two tips, i and j, on a transmission tree ( fig. 1). dij summarizes the total evolutionary distance between two infected hosts at the moment of measuring the trait value and is measured in substitutions per site for real trees and arbitrary time units for simulated trees. We begin our report with a result from HIV data demonstrating the relevance of within-host evolution for estimating heritability.

The lg(spVL) Correlation in HIV Phylogenetic Pairs Decreases with dij

We used one-way analysis of variance (ANOVA, rA) and Spearman correlation (rSp) to estimate the correlation in phylogenetic pairs (PP) extracted from a recently published transmission tree of 8,483 HIV patients ( Hodcroft et al. 2014). As defined in Shirreff et al. (2013), phylogenetic pairs represent pairs of tips in the transmission tree that are mutually nearest to each other by phylogenetic distance (dij) ( fig. 1). We ordered the PPs by dij and split them into ten strata of equal size (deciles), evaluating the correlation between pair trait values (rA and rSp) in each stratum. The point estimates and the 95% confidence intervals (CI) are shown with black and magenta points and error bars on figure 3. Dashed horizontal bars denote the 95% CI for rA evaluated on all phylogenetic pairs. Despite some irregularities, there is a well pronounced pattern of decay in the correlation—strata to the left (small dij) tend to have higher rA values than strata to the right (big dij). The values of rA closely matched the values from other correlation estimators, such as DR (b) and the Pearson product mean correlation (r) (results not shown). We performed ordinary least squares regressions (OLS) of the values r A , D k and r Sp , D k on the mean phylogenetic distance, d i j , k ¯ ⁠ , in each stratum, k = 1 , … , 10 ⁠ . The slopes of both regressions were significantly negative (P<0.05) and are shown as black and magenta lines on figure 3. Similar slopes were obtained when using other stratifications of the data ( supplementary fig. S1 , Supplementary Material online).

Correlation between lg(spVL)-values in HIV phylogenetic pairs. A sample of 1917 PPs with lg(spVL)-measurements from HIV patients shows a decrease in the correlation (ICC) between pair trait values as a function of the pair phylogenetic distance dij. The point estimates and 95% CIs in ten strata of equal size (deciles) are depicted as points and error bars positioned at the mean dij for each stratum, d i j ¯ ⁠ . Black and magenta points with error-bars denote the estimated rA and rSp in the real data. Dashed horizontal bars denote the 95% CI for rA evaluated on all phylogenetic pairs. A black and a magenta inclined line denote the least squares linear regression of rA and rSp on d i j ¯ ⁠ . Brown and green points with error bars denote the estimated values of rA obtained after replacing the real trait values on the tree by values simulated under the maximum likelihood fit of the PMM and the POUMM methods, respectively (mean and 95% CI estimated from 100 replications). A brown and a green line show the expected correlation between pairs of tips at distance dij, as modeled under the ML-fit of the PMM and the POUMM (eqs. 2 and 3). A light-brown and a light-green region depict the 95% high posterior density (HPD) intervals inferred from Bayesian fit of the two models (Materials and Methods).

Correlation between lg(spVL)-values in HIV phylogenetic pairs. A sample of 1917 PPs with lg(spVL)-measurements from HIV patients shows a decrease in the correlation (ICC) between pair trait values as a function of the pair phylogenetic distance dij. The point estimates and 95% CIs in ten strata of equal size (deciles) are depicted as points and error bars positioned at the mean dij for each stratum, d i j ¯ ⁠ . Black and magenta points with error-bars denote the estimated rA and rSp in the real data. Dashed horizontal bars denote the 95% CI for rA evaluated on all phylogenetic pairs. A black and a magenta inclined line denote the least squares linear regression of rA and rSp on d i j ¯ ⁠ . Brown and green points with error bars denote the estimated values of rA obtained after replacing the real trait values on the tree by values simulated under the maximum likelihood fit of the PMM and the POUMM methods, respectively (mean and 95% CI estimated from 100 replications). A brown and a green line show the expected correlation between pairs of tips at distance dij, as modeled under the ML-fit of the PMM and the POUMM (eqs. 2 and 3). A light-brown and a light-green region depict the 95% high posterior density (HPD) intervals inferred from Bayesian fit of the two models (Materials and Methods).

The above result shows that the value of a heritability estimator based on the correlation within phylogenetic pairs (including DR couples) depends strongly on dij. Another issue of all estimators of H 2 using the correlation in phylogenetic or DR pairs is that the underlying statistical methods require independence between the pairs—the trait values in one pair should not influence or be correlated with the trait values in any other pair. This assumption is not valid in general, due to the phylogenetic relationship between all patients. One way to mitigate the effects of phylogenetic relationship between pairs is to limit the analysis to the closest pairs (i.e., pairs, for which dij does not exceed some user specified threshold). This approach has the drawback of omitting much of the data from the analysis. As an alternative taking advantage of the entire tree, it is possible to correct for the phylogenetic relationship by using a phylogenetic comparative method (PCM). PCMs attempt to solve both of the above problems, because they 1) incorporate the branch lengths in the transmission tree to model the variance–covariance structure of the data and 2) correct for the phylogenetic correlation when estimating evolutionary parameters or the phylogenetic heritability of the trait ( Felsenstein 1985 Housworth et al. 2004 Alizon et al. 2010). These advantages of the PCMs come at the price of assuming a specific stochastic process as a model of the trait evolution along the tree. In the next subsection, we show that assuming an inappropriate process for the trait evolution can cause a significant bias in the estimate of phylogenetic heritability.

A Brownian Motion Process Cannot Reproduce the Decay of Correlation in the UK Data

We implemented a maximum likelihood and a Bayesian fit of the PMM ( Lynch 1991 Housworth et al. 2004) and its extension to an Ornstein–Uhlenbeck model of evolution (POUMM) ( Hansen 1997 Mitov and Stadler 2016 Blanquart et al. 2017). The PMM and the POUMM assume an additive model of the trait values, z ( t ) = g ( t ) + e ⁠ , in which z(t) represents the trait value at time t for a given lineage of the tree, g(t) represents a heritable (genotypic) value at time t for this lineage and e represents a nonheritable contribution summarizing the effects of the host and his/her environment on the trait and the measurement error. The only difference between the two models is their assumption about the evolution of g(t) along the branches of the tree—the PMM assumes a Brownian motion process the POUMM assumes an Ornstein–Uhlenbeck process ( Uhlenbeck and Ornstein 1930 Lande 1976 Hansen 1997).

Using the maximum likelihood estimates of the model parameters ( supplementary table S1 , Supplementary Material online), we simulated random trait trajectories on the UK tree, running 100 replications for each model. For each replication, we estimated the correlation, rA, in PPs using the simulated values instead of the real values. The resulting correlation estimates are shown on figure 3 as brown and green points and error bars for the PMM and POUMM simulations, respectively. We notice that there is a significant difference between the correlation estimates of the two models. In particular, in the leftmost decile the POUMM estimate is significantly higher than the PMM estimate (the POUMM 95% CI excludes the PMM estimate).

The last approximation in equation (5) follows from the fact that the term exp ( − 8.35 + 36.47 d i j ) is nearly 0 for the range of phylogenetic distances ( ⁠ d i j ∈ [ 0 , 0.14 ] ⁠ ) in the UK tree (see supplementary information , Supplementary Material online, for further details on the above approximations).

Equations (4) and (5) represent a linear and an exponential model of the correlation as a function of dij. The values of these equations at dij=0 are equal to the phylogenetic heritabilities estimated at the mean root-tip distance t ¯ under PMM and POUMM (details on that later). The slope of the linear model ( eq. 4) equals −0.36 (95% HPD [−0.58, −0.21]). The rate of the exponential decay ( eq. 5) equals the POUMM parameter α=28.78 (95% HPD [16.64, 46.93]) and the half-life of decay equals ln ⁡ ( 2 ) / α = 0.02 substitutions per site (95% HPD [0.01, 0.04]).

Plotting the values of equations (4) and (5) and their 95% HPD intervals on figure 3 reveals visually that the POUMM fits better to the data than the PMM. Statistically, this is confirmed by a lower Akaike Information Criterion (AICc) for the POUMM fit and a strictly positive HPD interval for the OU parameter α ( supplementary table S1 and fig. S8, Supplementary Material online). The slope of the linear model derived from the PMM fit ( eq. 4, brown line on fig. 3) is nearly flat compared with the slopes of the two OLS fits (black and magenta lines on fig. 3). To explain this, we notice that in PMM, the covariance in phylogenetic pairs and the variance at the population level are modeled as linear functions of the root-mrca distance (tij) and the root-tip distance (t) (numerator and denominator in eq. 2). Importantly, both of these linear functions are bound to the same slope parameter, σ 2 . As it turns out, in the UK data, the covariance and the variance increase at different rates with respect to tij and t (see supplementary fig. S2 and supplementary information , Supplementary Material online). We conclude that the PMM is not an appropriate model for the correlation in phylogenetic pairs, being unable to model the above difference in the rates.

In the limit d i j → 0 ⁠ , a phylogenetic pair should be equivalent to a DR couple at the moment of transmission, that is, before the genotypes in the two hosts have diverged due to within-host evolution. Thus, it appears reasonable to use an estimate of the correlation at dij=0 as a proxy for the broad-sense heritability, H 2 , in the entire population. This idea has been applied in previous studies of HIV ( Hecht et al. 2010 Hollingsworth et al. 2010 Bachmann et al. 2017 Blanquart et al. 2017) as well as malaria ( Anderson et al. 2010). One potential obstacle to this approach is the possibility of introducing a sampling bias by filtering of the data. For example, if the study is on a trait, which evolves toward higher values during the course of infection, patients with lower trait values would tend to be more frequent among the CPPs than in the entire population. Thus, there is no guarantee that the trait distribution and, therefore, the heritability measured in the CPPs equals the heritability in the entire population. This problem of sampling bias affects both, resemblance-based as well as the currently used phylogenetic comparative methods. This suggests that the approach of imposing a threshold on dij or estimating the correlation (rA, rSp or another correlation measure) at dij=0 needs further validation. In the next subsection, we use simulations of the toy model to show that sampling bias, although present, is comparatively small with respect to the negative bias due to measurement delay.

ANOVA-CPP and POUMM Are the Least Biased Heritability Estimators in Toy-Model Simulations

Grouping of the trait values by identical pathogen genotype. We evaluated the coefficient of determination adjusted for finite sample size, R adj 2 ⁠ , and the intraclass correlation (ICC) estimated using one-way ANOVA, r A [ id ] ⁠ . The main difference between these two estimators is the ANOVA assumption that the group-means (genotypic values) are sampled from a distribution of potentially many more genotypes than the ones found in the data. In contrast, R adj 2 assumes that all genotypes in the population are present in the sample. Since the latter assumption is true for the simulated epidemics, R adj 2 represents the reference (true) value of H 2 to which all other estimates are compared.

Known DR couples. We evaluated the regression slope of recipient on donor values in three ways: 1) b—based on the trait values at the moment of diagnosing the infection 2) b0—based on the trait values right after the transmission events and 3) b d i j ′ —based on the subsample of diagnosed couples having dij not exceeding a threshold d i j ′ ⁠ . Based on a trade-off between precision and bias, we specified d i j ′ = D 1 ⁠ , D1 denoting the first decile in the empirical distribution of dij (see supplementary information , Supplementary Material online).

Phylogenetic pairs (PPs) in T 10 k ⁠ . We evaluated ICC using ANOVA in three ways: 1) rA—based on all PPs 2) r A , D 1 —based on CPPs defined as PPs in T 10 k having dij not exceeding the first decile, D1 and 3) r A , 0 , lin —the estimated intercept from a linear regression of the values r A , D k on the mean values d i j , k in each decile, k = 1 , … , 10 ⁠ For the latter two estimators, which attempt to estimate rA at dij=0, we use the acronym ANOVA-CPP. As an alternative to ANOVA, which is more robust to outliers (e.g., extreme values at the tails of the trait distribution), we evaluated the Spearman correlation in the first decile, hereby denoted as r Sp , D 1 ⁠ .

Transmission tree T 10 k ⁠ . We evaluated the phylogenetic heritability based on the ML fit of the PMM and POUMM models. Specifically, we compared the classical formula evaluated at the mean root-tip distance t ¯ in the tree (eqs. 10 and 12) ( Housworth et al. 2004 Leventhal and Bonhoeffer 2016) and the empirical formula based on the sample trait variance, s 2 (z) (eqs. 11 and 13) (described in Materials and Methods). For the PMM, we denote these estimators by H BM 2 ( t ¯ ) and H BM e 2 ⁠ for the POUMM, we use the symbols H OU 2 ( t ¯ ) and H OU e 2 ⁠ :

Table 1 summarizes the mathematical definition and the assumptions of the above estimators. A more detailed description of the PMM and the POUMM methods is provided in Materials and Methods. The referenced textbooks on quantitative genetics ( Lynch and Walsh 1998) are excellent references for the other methods.

Tested Estimators of the Broad-Sense Heritability of Pathogen Traits.

Input Data . Method (Abbreviation) . Assumptions . Estimator .
Grouping by identical infecting strain Adjusted coefficient of determination The sample of data contains all genotypes present in the population R adj 2 = 1 − N − 1 N − K s 2 ( z − G ^ ) s 2 ( z ) (6)
One-way analysis of variance (ANOVA) Independently sampled genotypes r A [ id ] = ( M S b − M S e ) / n ( M S b − M S e ) / n + M S e (7)
i.i.n.d. trait-values within each group
Equal within-group variances (homoscedasticity)
Known donor–recipient couples Donor–recipient regression (DR) Independently sampled donor–recipient couples
Equal residual variance across the range of donor-values (homoscedasticity) b = s ( z don , z rcp ) s 2 ( z don ) , (8)
Equal donor and population variances variants: b ⁠ , b 0 ⁠ , b d i j ′
Phylogenetic pairs (PPs) ANOVA on all/closest PPs (ANOVA-PP, ANOVA-CPP) ANOVA assumptions (see above) Defined as in equation (7), but calculated on PPs
variants: r A ⁠ , r A , d i j ′
Spearman correlation on all/closest PPs PPs are independent from one another Pearson (product mean) correlation, calculated on the ranks of the trait-values.
variants: r Sp ⁠ , r Sp , d i j ′
Linear regression of rA on dij upon a stratification rAdepends linearly on dijThe intercept, r A , 0 , l i n ⁠ , from the OLS fit of the model
Equal residual variance across the range of dij r A ( d i j ) = r A , 0 , l i n + ω 1 d i j . (9)
Transmission tree Phylogenetic mixed model (PMM) Branching BM evolution H BM 2 ( t ¯ ) = t ¯ σ 2 / ( t ¯ σ 2 + σ e 2 ) (10)
i.i.n.d. distributed environmental deviation, e ∼ N ( 0 , σ e 2 ) H BM e 2 = 1 − σ e 2 / s 2 ( z ) (11)
Phylogenetic Ornstein–Uhlenbeck mixed model (POUMM) Branching OU evolution H OU 2 ( t ¯ ) = σ 2 ( 1 − exp ( − 2 α t ¯ ) ) σ 2 ( 1 − exp ( − 2 α t ¯ ) ) + 2 α σ e 2 (12)
i.i.n.d. environmental deviation, e ∼ N ( 0 , σ e 2 ) H OU e 2 = 1 − σ e 2 / s 2 ( z ) (13)
Input Data . Method (Abbreviation) . Assumptions . Estimator .
Grouping by identical infecting strain Adjusted coefficient of determination The sample of data contains all genotypes present in the population R adj 2 = 1 − N − 1 N − K s 2 ( z − G ^ ) s 2 ( z ) (6)
One-way analysis of variance (ANOVA) Independently sampled genotypes r A [ id ] = ( M S b − M S e ) / n ( M S b − M S e ) / n + M S e (7)
i.i.n.d. trait-values within each group
Equal within-group variances (homoscedasticity)
Known donor–recipient couples Donor–recipient regression (DR) Independently sampled donor–recipient couples
Equal residual variance across the range of donor-values (homoscedasticity) b = s ( z don , z rcp ) s 2 ( z don ) , (8)
Equal donor and population variances variants: b ⁠ , b 0 ⁠ , b d i j ′
Phylogenetic pairs (PPs) ANOVA on all/closest PPs (ANOVA-PP, ANOVA-CPP) ANOVA assumptions (see above) Defined as in equation (7), but calculated on PPs
variants: r A ⁠ , r A , d i j ′
Spearman correlation on all/closest PPs PPs are independent from one another Pearson (product mean) correlation, calculated on the ranks of the trait-values.
variants: r Sp ⁠ , r Sp , d i j ′
Linear regression of rA on dij upon a stratification rAdepends linearly on dijThe intercept, r A , 0 , l i n ⁠ , from the OLS fit of the model
Equal residual variance across the range of dij r A ( d i j ) = r A , 0 , l i n + ω 1 d i j . (9)
Transmission tree Phylogenetic mixed model (PMM) Branching BM evolution H BM 2 ( t ¯ ) = t ¯ σ 2 / ( t ¯ σ 2 + σ e 2 ) (10)
i.i.n.d. distributed environmental deviation, e ∼ N ( 0 , σ e 2 ) H BM e 2 = 1 − σ e 2 / s 2 ( z ) (11)
Phylogenetic Ornstein–Uhlenbeck mixed model (POUMM) Branching OU evolution H OU 2 ( t ¯ ) = σ 2 ( 1 − exp ( − 2 α t ¯ ) ) σ 2 ( 1 − exp ( − 2 α t ¯ ) ) + 2 α σ e 2 (12)
i.i.n.d. environmental deviation, e ∼ N ( 0 , σ e 2 ) H OU e 2 = 1 − σ e 2 / s 2 ( z ) (13)

Note .—Notation: s 2 ( · ) ⁠ , sample variance s ( · , · ) ⁠ , sample covariance N, number of patients K, number of distinct groups of patients, that is, genotypes or phylogenetic pairs z, measured values G ^ ⁠ , estimated genotypic values: mean values from patients carrying a given genotype z don ⁠ , donor values z rcp ⁠ , recipient values M S e ⁠ , within-group mean square: M S e = ∑ ( z i − z ¯ i ) 2 N − K ⁠ , where zi is an individual’s value and z ¯ i is the mean value of the group to which the individual belongs M S b ⁠ , among-group mean square: M S b = ∑ ( z ¯ i − z ¯ ) 2 K − 1 ⁠ , where z ¯ i is defined as above and z ¯ is the population mean value n, weighted mean number patients in a group, that is, n=2 for phylogenetic pairs and n = ( N − ∑ n i 2 N ) / ( K − 1 ) for groups of variable size α, σ, σe: PMM/POUMM parameters (described in Materials and Methods).

i.i.n.d., independent and identically normally distributed dij, phylogenetic distance between donor–recipient pairs or phylogenetic pairs d i j ′ ⁠ , threshold on dij (see text).

Tested Estimators of the Broad-Sense Heritability of Pathogen Traits.

Input Data . Method (Abbreviation) . Assumptions . Estimator .
Grouping by identical infecting strain Adjusted coefficient of determination The sample of data contains all genotypes present in the population R adj 2 = 1 − N − 1 N − K s 2 ( z − G ^ ) s 2 ( z ) (6)
One-way analysis of variance (ANOVA) Independently sampled genotypes r A [ id ] = ( M S b − M S e ) / n ( M S b − M S e ) / n + M S e (7)
i.i.n.d. trait-values within each group
Equal within-group variances (homoscedasticity)
Known donor–recipient couples Donor–recipient regression (DR) Independently sampled donor–recipient couples
Equal residual variance across the range of donor-values (homoscedasticity) b = s ( z don , z rcp ) s 2 ( z don ) , (8)
Equal donor and population variances variants: b ⁠ , b 0 ⁠ , b d i j ′
Phylogenetic pairs (PPs) ANOVA on all/closest PPs (ANOVA-PP, ANOVA-CPP) ANOVA assumptions (see above) Defined as in equation (7), but calculated on PPs
variants: r A ⁠ , r A , d i j ′
Spearman correlation on all/closest PPs PPs are independent from one another Pearson (product mean) correlation, calculated on the ranks of the trait-values.
variants: r Sp ⁠ , r Sp , d i j ′
Linear regression of rA on dij upon a stratification rAdepends linearly on dijThe intercept, r A , 0 , l i n ⁠ , from the OLS fit of the model
Equal residual variance across the range of dij r A ( d i j ) = r A , 0 , l i n + ω 1 d i j . (9)
Transmission tree Phylogenetic mixed model (PMM) Branching BM evolution H BM 2 ( t ¯ ) = t ¯ σ 2 / ( t ¯ σ 2 + σ e 2 ) (10)
i.i.n.d. distributed environmental deviation, e ∼ N ( 0 , σ e 2 ) H BM e 2 = 1 − σ e 2 / s 2 ( z ) (11)
Phylogenetic Ornstein–Uhlenbeck mixed model (POUMM) Branching OU evolution H OU 2 ( t ¯ ) = σ 2 ( 1 − exp ( − 2 α t ¯ ) ) σ 2 ( 1 − exp ( − 2 α t ¯ ) ) + 2 α σ e 2 (12)
i.i.n.d. environmental deviation, e ∼ N ( 0 , σ e 2 ) H OU e 2 = 1 − σ e 2 / s 2 ( z ) (13)
Input Data . Method (Abbreviation) . Assumptions . Estimator .
Grouping by identical infecting strain Adjusted coefficient of determination The sample of data contains all genotypes present in the population R adj 2 = 1 − N − 1 N − K s 2 ( z − G ^ ) s 2 ( z ) (6)
One-way analysis of variance (ANOVA) Independently sampled genotypes r A [ id ] = ( M S b − M S e ) / n ( M S b − M S e ) / n + M S e (7)
i.i.n.d. trait-values within each group
Equal within-group variances (homoscedasticity)
Known donor–recipient couples Donor–recipient regression (DR) Independently sampled donor–recipient couples
Equal residual variance across the range of donor-values (homoscedasticity) b = s ( z don , z rcp ) s 2 ( z don ) , (8)
Equal donor and population variances variants: b ⁠ , b 0 ⁠ , b d i j ′
Phylogenetic pairs (PPs) ANOVA on all/closest PPs (ANOVA-PP, ANOVA-CPP) ANOVA assumptions (see above) Defined as in equation (7), but calculated on PPs
variants: r A ⁠ , r A , d i j ′
Spearman correlation on all/closest PPs PPs are independent from one another Pearson (product mean) correlation, calculated on the ranks of the trait-values.
variants: r Sp ⁠ , r Sp , d i j ′
Linear regression of rA on dij upon a stratification rAdepends linearly on dijThe intercept, r A , 0 , l i n ⁠ , from the OLS fit of the model
Equal residual variance across the range of dij r A ( d i j ) = r A , 0 , l i n + ω 1 d i j . (9)
Transmission tree Phylogenetic mixed model (PMM) Branching BM evolution H BM 2 ( t ¯ ) = t ¯ σ 2 / ( t ¯ σ 2 + σ e 2 ) (10)
i.i.n.d. distributed environmental deviation, e ∼ N ( 0 , σ e 2 ) H BM e 2 = 1 − σ e 2 / s 2 ( z ) (11)
Phylogenetic Ornstein–Uhlenbeck mixed model (POUMM) Branching OU evolution H OU 2 ( t ¯ ) = σ 2 ( 1 − exp ( − 2 α t ¯ ) ) σ 2 ( 1 − exp ( − 2 α t ¯ ) ) + 2 α σ e 2 (12)
i.i.n.d. environmental deviation, e ∼ N ( 0 , σ e 2 ) H OU e 2 = 1 − σ e 2 / s 2 ( z ) (13)

Note .—Notation: s 2 ( · ) ⁠ , sample variance s ( · , · ) ⁠ , sample covariance N, number of patients K, number of distinct groups of patients, that is, genotypes or phylogenetic pairs z, measured values G ^ ⁠ , estimated genotypic values: mean values from patients carrying a given genotype z don ⁠ , donor values z rcp ⁠ , recipient values M S e ⁠ , within-group mean square: M S e = ∑ ( z i − z ¯ i ) 2 N − K ⁠ , where zi is an individual’s value and z ¯ i is the mean value of the group to which the individual belongs M S b ⁠ , among-group mean square: M S b = ∑ ( z ¯ i − z ¯ ) 2 K − 1 ⁠ , where z ¯ i is defined as above and z ¯ is the population mean value n, weighted mean number patients in a group, that is, n=2 for phylogenetic pairs and n = ( N − ∑ n i 2 N ) / ( K − 1 ) for groups of variable size α, σ, σe: PMM/POUMM parameters (described in Materials and Methods).

i.i.n.d., independent and identically normally distributed dij, phylogenetic distance between donor–recipient pairs or phylogenetic pairs d i j ′ ⁠ , threshold on dij (see text).

Within: neutral/Between: neutral

Within: select/Between: neutral

Within: neutral/Between: select

Within: select/Between: select

For each of these scenarios and mean contact interval 1 / κ ∈ < 2 , 4 , 6 , 8 , 10 , 12 >(arbitrary time units), we executed ten simulations resulting in a total of 4 × 6 × 10 = 240 simulations. Of the 240 simulations, 175 resulted in epidemic outbreaks of at least 10,000 diagnosed hosts. For each outbreak, we analyzed the populations of the first up to 10,000 diagnosed hosts.

Rarer transmission events (bigger 1 / κ ⁠ ) result in longer transmission trees and, therefore, longer average phylogenetic distance between tips, dij ( supplementary fig. S3 , Supplementary Material online). This enabled demonstrating the effect of accumulating within-host evolution on the different heritability estimators ( fig. 4).

Heritability estimates in toy-model simulations. (A–D) H 2 -estimates in simulations of “neutral” and “select” within-/between-host dynamics. Each group of box-whiskers summarizes the simulations for a fixed scenario and contact interval, 1 / κ ⁠ white boxes (background) denote true heritability, colored boxes denote estimates (foreground). Statistical significance is evaluated through t-tests summarized in table 2.

Heritability estimates in toy-model simulations. (A–D) H 2 -estimates in simulations of “neutral” and “select” within-/between-host dynamics. Each group of box-whiskers summarizes the simulations for a fixed scenario and contact interval, 1 / κ ⁠ white boxes (background) denote true heritability, colored boxes denote estimates (foreground). Statistical significance is evaluated through t-tests summarized in table 2.

Figure 4 shows that the estimators b D 1 ⁠ , b ⁠ , r A , D 1 ⁠ , and r A are negatively biased in general for all toy-model scenarios. This bias tends to increase with the mean contact interval, 1 / κ (respectively, dij), because random within-host mutation tends to decrease the genetic overlap between DRs and phylogenetic pairs ( supplementary fig. S4 , Supplementary Material online). The negative bias was far less pronounced when imposing a threshold on dij but this came at the cost of precision (less biased but longer box-whisker plots for b D 1 and r A , D 1 compared with b and rA) ( fig. 4). Several additional sources of bias were revealed when considering the practically unavailable estimators b0 and r A [ i d ] ⁠ . The estimator r A [ i d ] was positively biased due to the small number of simulated genotypes (only six)—this was validated through additional simulations showing that r A [ i d ] converges to the true value for a slightly bigger number of genotypes (e.g., K≥24 genotypes, see supplementary information , Supplementary Material online). The estimator b0 was behaving accurately in the neutral/neutral scenario (excluding very short contact intervals) but tended to have a bias in both directions in all scenarios involving selection. The main reason for these biases was the phenomenon of “sampling bias” consisting in a difference between the distributions of measured values in the DR couples and the population of interest. Although its magnitude was comparatevely small in the simulations, we presume that sampling bias could play an important role in real biological applications. We already gave an example of this bias in the previous subsection. Another manifestation of sampling bias is the fact that b0 does not fully eliminate the effect of within host-evolution (and selection) in the donors. This is why, in cases of selection, the phenotypic variance in the donors tends to be smaller than the variance in the recipients as well as the variance in the population ( supplementary fig. S5 , Supplementary Material online). Additional details on these potential sources of bias are provided in supplementary information , Supplementary Material online.

Further, the simulations showed that a worsening fit of the BM model on longer transmission trees was causing an inflated estimate of the environmental deviation, σ e ⁠ , in the PMM fits and, therefore, a negative bias in H BM 2 ( t ¯ ) and H BM e 2 (compare estimates for small and big values of 1 / κ on fig. 4 and supplementary fig. S6 C, Supplementary Material online). In contrast with the PMM, the POUMM estimates, H OU 2 ( t ¯ ) and H OU e 2 were far more accurate and the value of σ e in the POUMM ML fit was nearly matching the true nonheritable deviation in most simulations ( fig. 4 and supplementary fig. S6 C, Supplementary Material online). The better ML fit of the POUMM was confirmed by stronger statistical support, namely by lower AICc values in all toy-model simulations ( supplementary fig. S6 D, Supplementary Material online).

The fact that the POUMM outperformed the PMM in all scenarios contradicted with the initial belief that the PMM should be the better suited model for a neutrally evolving trait represented by the neutral/neutral scenario, whereas the POUMM should fit better to scenarios involving selection. It was also counterintuitive that the inferred parameter α from the POUMM model was significantly positive in all simulations including the neutral/neutral scenario ( supplementary fig. S6 B, Supplementary Material online). To better understand this phenomenon, we performed the PP stratification analysis on the toy-model data ( supplementary fig. S7 , Supplementary Material online). This revealed a pattern of correlation that decays exponentially with dij. The shape of exponential decay was mostly pronounced for longer contact intervals, 1 / κ ⁠ , particularly in the neutral/neutral scenario (first column on supplementary fig. S7 , Supplementary Material online). In supplementary information , Supplementary Material online, we show that an exponentially decaying phenotypic correlation is consistent with a neutrally mutating genotype under a Jukes–Cantor substitution model ( Yang 2006). The decay of the correlation was still present in scenarios involving within- and/or between-host selection but the observed pattern was rather irregular and deviating from an exponential function of dij ( supplementary fig. S7 , Supplementary Material online). In most cases, the ML fit of the PMM method was a bad fit to the decay of correlation (brown dots and error-bars on supplementary fig. S7 , Supplementary Material online) for longer contact intervals, there was a tendency toward constant values of the correlation under PMM far below the true value (brown dots and error bars on supplementary fig. S7 , Supplementary Material online). This explains the overall better accuracy of the POUMM versus the PMM method.

Table 2 shows the average bias of each tested estimator for each of the four scenarios. We conclude that, apart from the practically inaccessible estimators based on grouping by identical genotype ( ⁠ R adj 2 and r A [ id ] ⁠ ), the most accurate estimators of H 2 in the toy-model simulations are H OU 2 ( t ¯ ) and H OU e 2 followed by estimators of the correlation in PPs minimizing the phylogenetic distance dij, that is ( ⁠ r A , D 1 ⁠ , r A , 0 , l i n ⁠ , r Sp , D 1 ⁠ ). In the next subsection, we report the results from these estimators in the UK HIV data.

Mean Difference H 2 ̂ − R adj 2 from the Toy-Model Simulations Grouped by Scenario.

Within: . Neutral . Neutral . Select . Select .
Between: . Neutral . Select . Neutral . Select .
N50 41 47 37
b 0 −0.01 * −0.02 ** 0.05 ** 0.04 **
b D 1 −0.07 ** −0.04 ** 0 −0.01
b−0.25 ** −0.2 ** −0.07 ** −0.06 **
r A [ i d ] 0.05 ** 0.05 ** 0.08 ** 0.06 **
r A , 0 , l i n ^ −0.05 ** −0.06 ** 0.01 −0.04 **
r A , D 1 −0.05 ** −0.06 ** 0 −0.03 *
r A −0.18 ** −0.15 ** −0.06 ** −0.08 **
r Sp , D 1 −0.05 ** −0.05 ** −0.05 ** −0.07 **
H BM 2 ( t ¯ ) −0.17 ** −0.17 ** −0.01 −0.04 *
H BM e 2 −0.28 ** −0.24 ** −0.12 ** −0.16 **
H OU 2 ( t ¯ ) −0.01 −0.02 ** 0.01 * 0.03 **
H OU e 2 −0.01 −0.02 ** 0.01 * 0.03 **
Within: . Neutral . Neutral . Select . Select .
Between: . Neutral . Select . Neutral . Select .
N50 41 47 37
b 0 −0.01 * −0.02 ** 0.05 ** 0.04 **
b D 1 −0.07 ** −0.04 ** 0 −0.01
b−0.25 ** −0.2 ** −0.07 ** −0.06 **
r A [ i d ] 0.05 ** 0.05 ** 0.08 ** 0.06 **
r A , 0 , l i n ^ −0.05 ** −0.06 ** 0.01 −0.04 **
r A , D 1 −0.05 ** −0.06 ** 0 −0.03 *
r A −0.18 ** −0.15 ** −0.06 ** −0.08 **
r Sp , D 1 −0.05 ** −0.05 ** −0.05 ** −0.07 **
H BM 2 ( t ¯ ) −0.17 ** −0.17 ** −0.01 −0.04 *
H BM e 2 −0.28 ** −0.24 ** −0.12 ** −0.16 **
H OU 2 ( t ¯ ) −0.01 −0.02 ** 0.01 * 0.03 **
H OU e 2 −0.01 −0.02 ** 0.01 * 0.03 **

Note .—Statistical significance is estimated by Student’s t-tests, P values denoted by an asterisk as follows: * P<0.01 **P<0.001. Gray background indicates estimates that are unavailable in practice.

Mean Difference H 2 ̂ − R adj 2 from the Toy-Model Simulations Grouped by Scenario.

Within: . Neutral . Neutral . Select . Select .
Between: . Neutral . Select . Neutral . Select .
N50 41 47 37
b 0 −0.01 * −0.02 ** 0.05 ** 0.04 **
b D 1 −0.07 ** −0.04 ** 0 −0.01
b−0.25 ** −0.2 ** −0.07 ** −0.06 **
r A [ i d ] 0.05 ** 0.05 ** 0.08 ** 0.06 **
r A , 0 , l i n ^ −0.05 ** −0.06 ** 0.01 −0.04 **
r A , D 1 −0.05 ** −0.06 ** 0 −0.03 *
r A −0.18 ** −0.15 ** −0.06 ** −0.08 **
r Sp , D 1 −0.05 ** −0.05 ** −0.05 ** −0.07 **
H BM 2 ( t ¯ ) −0.17 ** −0.17 ** −0.01 −0.04 *
H BM e 2 −0.28 ** −0.24 ** −0.12 ** −0.16 **
H OU 2 ( t ¯ ) −0.01 −0.02 ** 0.01 * 0.03 **
H OU e 2 −0.01 −0.02 ** 0.01 * 0.03 **
Within: . Neutral . Neutral . Select . Select .
Between: . Neutral . Select . Neutral . Select .
N50 41 47 37
b 0 −0.01 * −0.02 ** 0.05 ** 0.04 **
b D 1 −0.07 ** −0.04 ** 0 −0.01
b−0.25 ** −0.2 ** −0.07 ** −0.06 **
r A [ i d ] 0.05 ** 0.05 ** 0.08 ** 0.06 **
r A , 0 , l i n ^ −0.05 ** −0.06 ** 0.01 −0.04 **
r A , D 1 −0.05 ** −0.06 ** 0 −0.03 *
r A −0.18 ** −0.15 ** −0.06 ** −0.08 **
r Sp , D 1 −0.05 ** −0.05 ** −0.05 ** −0.07 **
H BM 2 ( t ¯ ) −0.17 ** −0.17 ** −0.01 −0.04 *
H BM e 2 −0.28 ** −0.24 ** −0.12 ** −0.16 **
H OU 2 ( t ¯ ) −0.01 −0.02 ** 0.01 * 0.03 **
H OU e 2 −0.01 −0.02 ** 0.01 * 0.03 **

Note .—Statistical significance is estimated by Student’s t-tests, P values denoted by an asterisk as follows: * P<0.01 **P<0.001. Gray background indicates estimates that are unavailable in practice.

Heratibality of lg(spVL) in the UK HIV Cohort

We evaluated the correlation in the CPPs (ANOVA and Spearman correlation) in data from the UK HIV cohort comprising lg(spVL) measurements and a tree of viral (pol) sequences from 8,483 patients inferred previously in ( Hodcroft et al. 2014). In addition, we performed a Bayesian fit of the POUMM and the PMM methods to the same data. The goal was to test our conclusions on a real data set and to compare the H 2 -estimates from CPPs and POUMM to previous PMM/ReML-estimates on exactly the same data ( Hodcroft et al. 2014).

In applying ANOVA-CPP, the first step has been to define the threshold phylogenetic distance for defining CPPs. To that end, we explored different stratifications of the PPs as shown on supplementary figure S1 B, Supplementary Material online, and a scatter plot of the phylogenetic distances against the absolute phenotypic differences, | Δ lg ⁡ ( spVL ) | ( fig. 5A). This revealed a small set of 116 PPs having d i j ≤ 10 − 4 and narrowly coinciding with the first vigintile (also called 20-quantile or ventile) of dij. The phylogenetic distance in all remaining tip-pairs was more than an order of magnitude bigger, that is, d i j > 10 − 3 ⁠ . Given that the phylogenetic distance on the transmission tree is measured in substitutions per site and the length of the pol-region is in the order of 10 3 sites, we presume that the above set of 116 PPs corresponds to a set of 116 pairs of identical pol consensus sequences (no sequence data were available to check this). Based on this observation, we defined the above pairs as CPPs and the threshold was formally set to d i j ′ = 10 − 4 ⁠ . We validated that the CPPs were randomly distributed along the tree ( fig. 5B). The random distribution of the CPPs along the transmission tree suggests that these phylogenetic pairs correspond to randomly occurring early detections of infection (trait values from each pair depicted as magenta segments on fig. 5B). To check that the filtering of the data, did not introduce a considerable sampling bias due to selection (see previous subsection), we also validated that there was no substantial difference in the trait distributions of all patients, the PPs and the CPPs ( fig. 5C).

Phylogenetic pairs in lg(spVL) data from the United Kingdom. (A) A scatter plot of the phylogenetic distances between pairs of tips against their absolute phenotypic differences: gray, PPs ( ⁠ d i j > 10 − 4 ⁠ ) magenta, CPPs ( ⁠ d i j < 10 − 4 ⁠ ). A black line shows the linear regression of | Δ lg ⁡ ( spVL ) | on dij (the slope of the regression was statistically positive at the 0.01 level). (B) A box-plot representing the trait-distribution along the transmission tree. Each box-whisker represents the lg(spVL)-distribution of patients grouped by their distance from the root of the tree measured in substitutions per site. Wider boxes indicate groups bigger in size. Segments in magenta denote lg(spVL)-values in CPPs. (C) A box-plot of the lg(spVL)-distribution in all patients (black), PPs (gray), and CPPs (magenta).

Phylogenetic pairs in lg(spVL) data from the United Kingdom. (A) A scatter plot of the phylogenetic distances between pairs of tips against their absolute phenotypic differences: gray, PPs ( ⁠ d i j > 10 − 4 ⁠ ) magenta, CPPs ( ⁠ d i j < 10 − 4 ⁠ ). A black line shows the linear regression of | Δ lg ⁡ ( spVL ) | on dij (the slope of the regression was statistically positive at the 0.01 level). (B) A box-plot representing the trait-distribution along the transmission tree. Each box-whisker represents the lg(spVL)-distribution of patients grouped by their distance from the root of the tree measured in substitutions per site. Wider boxes indicate groups bigger in size. Segments in magenta denote lg(spVL)-values in CPPs. (C) A box-plot of the lg(spVL)-distribution in all patients (black), PPs (gray), and CPPs (magenta).

ANOVA-CPPs ( ⁠ r A , D 1 ⁠ , r A , 10 − 4 ⁠ , r A , V 1 ⁠ ) and the original PP-method rA

The intercept from the linear regression of rA on dij upon a stratification of the PPs into deciles ( ⁠ r A , 0 , l i n ⁠ , eq. 9)

Spearman correlatoin in CPPs ( ⁠ r Sp , D 1 ⁠ , r Sp , 10 − 4 ⁠ , r Sp , V 1 ⁠ ) and in all PPs ( ⁠ r Sp ⁠ ).

The intercept from the linear regression of r Sp on dij upon a stratification of the PPs into deciles ( ⁠ r Sp , 0 , l i n ⁠ )

POUMM ( ⁠ H OU 2 ( t ¯ ) ⁠ , H OU e 2 ⁠ ), versus PMM ( ⁠ H BM 2 ( t ¯ ) ⁠ , H BM e 2 ⁠ ) on the entire tree

The results from these analyses are reported in table 3. ANOVA- and Spearman-correlation estimates, which minimized the phylogenetic distance by means of regression or filtering of the phylogenetic pairs had point-estimates of r A , 10 − 4 = 0.17 and r Sp , 10 − 4 = 0.22 ⁠ . The slightly higher estimate for the Spearman correlation could be explained by the presence of outliers in the data. Applying the POUMM to the entire tree reported a point estimate H OU 2 ( t ¯ ) = 0.21 (8,483 patients, 95% CI [0.14, 0.29]).

Estimates of lg(spVL)-Heritability in HIV Data from the United Kingdom.

Method . N . H ^ 2 . 95% CI . 95% HPD .
Linear regression of rA on d i j ¯ in deciles (eq. 9) ( ⁠ r A , 0 , l i n ⁠ ) 10 points 0.17 [0.09, 0.24]
Linear regression of rSp on d i j ¯ in deciles ( ⁠ r Sp , 0 , l i n ⁠ ) 10 points 0.18 [0.11, 0.25]
ANOVA-CPP ( ⁠ r A , V 1 ⁠ ) 224 0.17 [−0.02, 0.31]
ANOVA-CPP ( ⁠ r A , 10 − 4 ⁠ ) 232 0.16 [0.01, 0.30]
ANOVA-CPP ( ⁠ r A , D 1 ⁠ ) 384 0.16 [0.06, 0.25]
ANOVA-PP (rA) a 3,834 0.11 [0.08, 0.14]
Spearman-CPP ( ⁠ r Sp , V 1 ⁠ ) 224 0.23 [0.05, 0.42]
Spearman-CPP ( ⁠ r Sp , 10 − 4 ⁠ ) 232 0.22 [0.03, 0.4]
Spearman-CPP ( ⁠ r Sp , D 1 ⁠ ) 384 0.2 [0.06, 0.34]
Spearman-PP (rSp) a 3,834 0.11 [0.06, 0.15]
POUMM ( ⁠ H OU 2 ( t ¯ ) ⁠ ) 8,483 0.21 [0.14, 0.29]
POUMM ( ⁠ H OU e 2 ⁠ ) 8,483 0.2 [0.13, 0.29]
PMM ( ⁠ H BM 2 ( t ¯ ) ⁠ ) b 8,483 0.08 [0.05, 0.12]
PMM ( ⁠ H BM e 2 ⁠ ) b 8,483 0.06 [0.02, 0.1]
PMM, ReML ( Hodcroft et al. 2014) b 8,483 0.06 [0.03, 0.09]
Method . N . H ^ 2 . 95% CI . 95% HPD .
Linear regression of rA on d i j ¯ in deciles (eq. 9) ( ⁠ r A , 0 , l i n ⁠ ) 10 points 0.17 [0.09, 0.24]
Linear regression of rSp on d i j ¯ in deciles ( ⁠ r Sp , 0 , l i n ⁠ ) 10 points 0.18 [0.11, 0.25]
ANOVA-CPP ( ⁠ r A , V 1 ⁠ ) 224 0.17 [−0.02, 0.31]
ANOVA-CPP ( ⁠ r A , 10 − 4 ⁠ ) 232 0.16 [0.01, 0.30]
ANOVA-CPP ( ⁠ r A , D 1 ⁠ ) 384 0.16 [0.06, 0.25]
ANOVA-PP (rA) a 3,834 0.11 [0.08, 0.14]
Spearman-CPP ( ⁠ r Sp , V 1 ⁠ ) 224 0.23 [0.05, 0.42]
Spearman-CPP ( ⁠ r Sp , 10 − 4 ⁠ ) 232 0.22 [0.03, 0.4]
Spearman-CPP ( ⁠ r Sp , D 1 ⁠ ) 384 0.2 [0.06, 0.34]
Spearman-PP (rSp) a 3,834 0.11 [0.06, 0.15]
POUMM ( ⁠ H OU 2 ( t ¯ ) ⁠ ) 8,483 0.21 [0.14, 0.29]
POUMM ( ⁠ H OU e 2 ⁠ ) 8,483 0.2 [0.13, 0.29]
PMM ( ⁠ H BM 2 ( t ¯ ) ⁠ ) b 8,483 0.08 [0.05, 0.12]
PMM ( ⁠ H BM e 2 ⁠ ) b 8,483 0.06 [0.02, 0.1]
PMM, ReML ( Hodcroft et al. 2014) b 8,483 0.06 [0.03, 0.09]

Note .—Also written are the results from a previous analysis on the same data set ( Hodcroft et al. 2014). “–”: the analysis was not done in the mentioned study. Gray background: estimates considered unreliable due to: a negative bias caused by measurement delays and b negative bias caused by BM violation. Uncertainty in the estimates is expressed in terms of 95% confidence intervals (CI), or, in the case of Bayesian inference, by 95% high posterior density intervals (HPDs).

Estimates of lg(spVL)-Heritability in HIV Data from the United Kingdom.

Method . N . H ^ 2 . 95% CI . 95% HPD .
Linear regression of rA on d i j ¯ in deciles (eq. 9) ( ⁠ r A , 0 , l i n ⁠ ) 10 points 0.17 [0.09, 0.24]
Linear regression of rSp on d i j ¯ in deciles ( ⁠ r Sp , 0 , l i n ⁠ ) 10 points 0.18 [0.11, 0.25]
ANOVA-CPP ( ⁠ r A , V 1 ⁠ ) 224 0.17 [−0.02, 0.31]
ANOVA-CPP ( ⁠ r A , 10 − 4 ⁠ ) 232 0.16 [0.01, 0.30]
ANOVA-CPP ( ⁠ r A , D 1 ⁠ ) 384 0.16 [0.06, 0.25]
ANOVA-PP (rA) a 3,834 0.11 [0.08, 0.14]
Spearman-CPP ( ⁠ r Sp , V 1 ⁠ ) 224 0.23 [0.05, 0.42]
Spearman-CPP ( ⁠ r Sp , 10 − 4 ⁠ ) 232 0.22 [0.03, 0.4]
Spearman-CPP ( ⁠ r Sp , D 1 ⁠ ) 384 0.2 [0.06, 0.34]
Spearman-PP (rSp) a 3,834 0.11 [0.06, 0.15]
POUMM ( ⁠ H OU 2 ( t ¯ ) ⁠ ) 8,483 0.21 [0.14, 0.29]
POUMM ( ⁠ H OU e 2 ⁠ ) 8,483 0.2 [0.13, 0.29]
PMM ( ⁠ H BM 2 ( t ¯ ) ⁠ ) b 8,483 0.08 [0.05, 0.12]
PMM ( ⁠ H BM e 2 ⁠ ) b 8,483 0.06 [0.02, 0.1]
PMM, ReML ( Hodcroft et al. 2014) b 8,483 0.06 [0.03, 0.09]
Method . N . H ^ 2 . 95% CI . 95% HPD .
Linear regression of rA on d i j ¯ in deciles (eq. 9) ( ⁠ r A , 0 , l i n ⁠ ) 10 points 0.17 [0.09, 0.24]
Linear regression of rSp on d i j ¯ in deciles ( ⁠ r Sp , 0 , l i n ⁠ ) 10 points 0.18 [0.11, 0.25]
ANOVA-CPP ( ⁠ r A , V 1 ⁠ ) 224 0.17 [−0.02, 0.31]
ANOVA-CPP ( ⁠ r A , 10 − 4 ⁠ ) 232 0.16 [0.01, 0.30]
ANOVA-CPP ( ⁠ r A , D 1 ⁠ ) 384 0.16 [0.06, 0.25]
ANOVA-PP (rA) a 3,834 0.11 [0.08, 0.14]
Spearman-CPP ( ⁠ r Sp , V 1 ⁠ ) 224 0.23 [0.05, 0.42]
Spearman-CPP ( ⁠ r Sp , 10 − 4 ⁠ ) 232 0.22 [0.03, 0.4]
Spearman-CPP ( ⁠ r Sp , D 1 ⁠ ) 384 0.2 [0.06, 0.34]
Spearman-PP (rSp) a 3,834 0.11 [0.06, 0.15]
POUMM ( ⁠ H OU 2 ( t ¯ ) ⁠ ) 8,483 0.21 [0.14, 0.29]
POUMM ( ⁠ H OU e 2 ⁠ ) 8,483 0.2 [0.13, 0.29]
PMM ( ⁠ H BM 2 ( t ¯ ) ⁠ ) b 8,483 0.08 [0.05, 0.12]
PMM ( ⁠ H BM e 2 ⁠ ) b 8,483 0.06 [0.02, 0.1]
PMM, ReML ( Hodcroft et al. 2014) b 8,483 0.06 [0.03, 0.09]

Note .—Also written are the results from a previous analysis on the same data set ( Hodcroft et al. 2014). “–”: the analysis was not done in the mentioned study. Gray background: estimates considered unreliable due to: a negative bias caused by measurement delays and b negative bias caused by BM violation. Uncertainty in the estimates is expressed in terms of 95% confidence intervals (CI), or, in the case of Bayesian inference, by 95% high posterior density intervals (HPDs).

Conversely, the heritability estimates from the original PP method (ANOVA or Spearman correlatoin on all PPs) and the PMM were significantly lower and falling below the 95% CIs from the POUMM ( table 3). This confirms the observation from the toy-model simulations that these estimators are negatively biased, since they ignore or inaccurately model the changing correlation within pairs of tips. We validated the stronger statistical support for the POUMM with respect to the PMM, by its lower AICc value ( supplementary table S1 , Supplementary Material online) and by the posterior density for the POUMM parameter α ( supplementary fig. S8 , Supplementary Material online).

Finally, we compared our estimates of lg(spVL)-heritability to previous applications of the same methods on different data sets ( fig. 6). In agreement with the toy-model simulations, estimates of H 2 using PMM or other BM-based phylogenetic methods (i.e., Blomberg’s K and Pagel’s λ) are notably lower than all other estimates, suggesting that these phylogenetic comparative methods underestimate H 2 resemblance-based estimates are down-biased by measurement delays (e.g., compare early vs. late in the Netherlands on fig. 6).

A comparison between H 2 H 2 -estimates from the UK HIV-cohort and previous estimates on African, Swiss, and Dutch data. (A) Estimates with minimized measurement delay (dark cadet-blue) and POUMM estimates (green) (B) Down-biased estimates due to higher measurement delays (light-blue) or violated BM-assumption (brown). Confidence is depicted either as segments indicating estimated 95% CI or P values in cases of missing 95% CIs. References to the corresponding publications are written as numbers in superscript as follows: 1: Tang et al. (2004) 2: Hecht et al. (2010) 3: Hollingsworth et al. (2010) 4: van der Kuyl et al. (2010) 5: Lingappa et al. (2013) 6: Yue et al. (2013) 7: Alizon et al. (2010) 8: Shirreff et al. (2013) 9: Hodcroft et al. (2014) 10: Blanquart et al. (2017) 11: Bertels et al. (2018) 12: this work. For clarity, estimates from previous studies, which are not directly comparable (e.g., previous results from Swiss MSM/strict data sets Alizon et al. 2010).

A comparison between H 2 H 2 -estimates from the UK HIV-cohort and previous estimates on African, Swiss, and Dutch data. (A) Estimates with minimized measurement delay (dark cadet-blue) and POUMM estimates (green) (B) Down-biased estimates due to higher measurement delays (light-blue) or violated BM-assumption (brown). Confidence is depicted either as segments indicating estimated 95% CI or P values in cases of missing 95% CIs. References to the corresponding publications are written as numbers in superscript as follows: 1: Tang et al. (2004) 2: Hecht et al. (2010) 3: Hollingsworth et al. (2010) 4: van der Kuyl et al. (2010) 5: Lingappa et al. (2013) 6: Yue et al. (2013) 7: Alizon et al. (2010) 8: Shirreff et al. (2013) 9: Hodcroft et al. (2014) 10: Blanquart et al. (2017) 11: Bertels et al. (2018) 12: this work. For clarity, estimates from previous studies, which are not directly comparable (e.g., previous results from Swiss MSM/strict data sets Alizon et al. 2010).

In summary, POUMM and ANOVA-CPP yield agreeing estimates for H 2 in the UK data and these estimates agree with resemblance-based estimates in data sets with short measurement delay (different African countries and the Netherlands). Similar to the toy-model simulations, we notice a well-pronounced pattern of negative bias for the other estimators, PMM and ANOVA-PP, as well as for the previous resemblance-based studies on data with long measurement delay.


Brain size and intelligence

Some investigators have examined the relationship between brain size and intelligence. 26 For humans, the statistical relationship is modest but significant. Obviously, the finding is only correlational: greater brain size may cause greater intelligence, greater intelligence may cause greater brain size, or both may be dependent on some third factor. Moreover, how efficiently the brain is used is probably more important than its size. For example, on average, men have larger brains than women, but women have better connections, through the corpus callosum, between the two hemispheres. So it is not clear which sex would have, on average, an advantage—probably neither. 27

The relationship between brain size and intelligence does not hold across species. 28 Rather, there seems to be a relationship between intelligence and brain size relative to the rough general size of the organism (level of encephalization).


Results

Characterizing the neuronal DNA methylation landscape across 8 brain regions

Previous reports from us and others have shown distinct epigenetic landscapes among functionally diverse brain regions [11, 12]. Our previous study was limited to a small number of individuals and brain regions. However, we demonstrated that brain region-specific DNA methylation was primarily present in neuronal rather than non-neuronal nuclei. Further, the ratio of neurons to non-neurons even between adjacent sections from the same brain region differed greatly, severely confounding analysis of differential methylation in non-purified nuclei. Therefore, we applied a strategy of neuronal nuclei purification prior to whole genome bisulfite sequencing. Analyzing a much larger number of individuals and brain regions enabled us to address the potential existence of VMRs (regions of interindividual variation in methylation within a tissue), their relationship to each other, and the relationship to SNPs identified in these GTEx donors. Neuronal nuclei were isolated from brain tissues based on positive NeuN (RBFOX3) staining via fluorescence-activated nuclei sorting and NeuN+ nuclei are referred to as neuronal, while noting that this fraction is composed of multiple subpopulations (Additional file 1: Fig. S1a). We examined 8 brain regions collected from GTEx donors: amygdala (n = 12), anterior cingulate cortex (BA24) (n = 15), caudate (n = 22), frontal cortex (BA9) (n = 24), hippocampus (n = 20), hypothalamus (n = 13), nucleus accumbens (n = 23), and putamen (n = 16). In addition, we analyzed methylation of DNA isolated from two non-brain tissues from GTEx donors: lung (n = 18) and thyroid (n = 19) for a total of 182 samples (Additional file 2: Table S1). We generated > 30 billion uniquely mapped 150-bp paired-end reads with an average depth > 10X post-processing (Additional file 3: Table S2). Several samples were excluded due to genotype discordance, as the shipped sample genotype did not match the biobank records (Additional file 4: Table S3). Five additional samples were excluded after principal component analysis revealed sample mislabeling prior to receipt by our lab. We confirmed this by determining which tissues most closely matched the methylation of these samples (Additional file 1: Fig. S1b).

Principal component analysis of global neuronal DNA methylation levels revealed clear segregation of these brain regions in the first two principal components (Fig. 1a). We performed a differential analysis of CpG methylation identifying CG-DMRs, i.e., regions of differential CpG methylation among neuronal nuclei isolated from each brain region. Given that a single CG-DMR can represent a difference among multiple brain regions, rather than perform 28 pairwise comparisons, we used an F-test to identify 174,482 statistically significant autosomal neuronal CG-DMRs which are defined as regions of the genome where at least 2 of the 8 brain regions have different levels of CpG methylation (Additional file 5: Table S4). We control the family-wise error rate at 5% by permutation, and we use BSmooth to leverage information from nearby CpGs by smoothing. In a pilot study, we profiled NeuN+ cells from 4 brain regions using whole genome bisulfite sequencing on samples from 6 different individuals not part of GTEx [12]. We find that 99.5% of our previously identified neuronal CG-DMRs (13,019/13,074) overlap with CG-DMRs from our new analysis of GTEx samples (after correcting for multiple testing). To make a more precise comparison, we examined the correlation between the methylation differences between two tissues as measured separately in Rizzardi et al. [12] and this study (tissues were the nucleus accumbens and prefrontal cortex, selected because most of the CG-DMRs from Rizzardi et al. [12] were between those two tissues). We find a striking correlation of 0.97 as shown in Figure S1c highlighting the reproducibility of our experimental and analytical approaches across biobanks. This high level of reproducibility holds even when examining methylation differences among all 13,074 neuronal DMRs identified in [12] between two very similar cortical regions (Additional file 1: Figure S1d), which suggests that our approach is conservative, likely because we control the family-wise error rate.

Identification of CG-DMRs among neurons of functionally distinct brain regions. DNA methylation was assessed in neuronal nuclei isolated from 8 brain regions as indicated from 12 to 24 individuals. a Principal component analysis of distances derived from average autosomal CpG methylation in 1-kb bins. b Hierarchical clustering of samples based on the average methylation per sample in the most discriminatory CG-DMRs (see the “Methods” section). c Heatmap representing log2 enrichment of CpGs within CG-DMRs and blocks identified in each CG-DMR analysis compared to the rest of the genome for genomic features. Gene models from GENCODEv26 (promoters, intronic, exonic, 5′UTR, 3′UTR, intergenic), CpG islands (CGIs) and related features from UCSC (shores, shelves, OpenSea), putative enhancer regions (enhancers and high confidence enhancers from PsychENCODE [18] and H3K27ac [19]). 5-group = CG-DMRs or blocks between all 5 tissue groups. d As in c, showing enrichment in regions of open chromatin in NeuN− and NeuN+ nuclei and NeuN+ nuclei isolated from the indicated brain regions (PV cortex, primary visual cortex Med. thalamus, mediodorsal thalamus) from [10]. e Example CG-DMRs covering the NPTXR gene showing average methylation values for NeuN+ nuclei from each tissue group color coded as in b. Regions of differential methylation are shaded in pink. f Expression of NPTXR from sample matched bulk brain tissues from GTEx v8 data release

To facilitate interpretation of our data, we conducted a simpler analysis. Specifically, we collapsed frontal cortex and anterior cingulate cortex samples into a “cortical” group and caudate, putamen, and nucleus accumbens into a “basal ganglia” group. The resulting 5 tissue groups are consistent with the developmental origins of the brain regions the telencephalon gives rise to the cerebral cortex (which branches into the frontal cortex, the anterior cingulate cortex, and the hippocampal formation) and cerebral nuclei (which branches into the amygdala and basal ganglia) while the diencephalon produces the hypothalamus [23]. We identified 181,146 autosomal neuronal CG-DMRs (196 Mb) among these 5 groups covering 11% of all CpGs. Further, the 5-group analysis captured 94% of the CG-DMRs identified in the 8-group analysis (Additional file 6: Table S5). Average DNA methylation levels of the most discriminatory CG-DMRs are, aside from several hippocampus samples, able to segregate samples into their tissue groups (Fig. 1b). We also identified 7671 large regions of differential CpG methylation (which we have previously termed “blocks” of differential CpG methylation these are identified using a larger bandwidth for smoothing) among the 5 tissue groups (Additional file 7: Table S6). These CG-blocks covered 260 Mb and were on average 33.9 kb in size. CG-DMRs were enriched in enhancer regions identified by PsychENCODE [13], H3K27ac peaks found in the adult brain [24], and in regulatory chromHMM states from 4 brain regions [25] (Fig. 1c, Additional file 1: Fig. S1e). We also observed enrichment of our CG-DMRs in regions of open chromatin identified in NeuN+ nuclei from 14 brain regions [11] (Fig. 1d). Example CG-DMRs within the neuronal pentraxin 1 (NP1) gene (NPTXR) are shown (Fig. 1e) with hypomethylation in the hippocampal neurons associated with increased expression in bulk hippocampus tissue (Fig. 1f). NP1 is involved in glutamate receptor internalization and has been implicated in Alzheimer’s disease as its upregulation in response to increased amyloid-beta promotes neuronal toxicity [26].

Though we grouped them together in our initial CG-DMR analysis, there is a clear distinction among the basal ganglia tissues. These regions are of particular interest due to their importance in addiction and reward pathways [27], yet no comprehensive analysis of methylation differences in the human brain has been performed to date. We performed an additional DMR analysis to assess the methylation differences among neurons from these tissues and identified 16,866 autosomal neuronal basal ganglia CG-DMRs (24 Mb) encompassing 1.7% of all CpGs (Additional file 8: Table S7). Consistent with their regional identity, these basal ganglia CG-DMRs were specifically enriched in open chromatin regions identified in neuronal nuclei from striatal tissues [11] (Fig. 1d). Over 13% (2295/16,866) of these basal ganglia, CG-DMRs were not identified in our 5-group CG-DMR analysis. We used the Genomic Regions Enrichment of Annotations Tool (rGREAT v4.0.0) [28] to identify enriched gene ontology terms associated with these unique basal ganglia CG-DMRs. Ten of the top 20 significantly enriched terms were related to ion transport or neuronal signaling (Additional file 9: Table S8). This result suggests that differential methylation near these genes (including Ca + 2 and K + voltage-gated channel subunit genes) could be involved in fine-tuning their expression in particular neuronal populations within the basal ganglia.

Differential methylation analysis identifies distinct neuronal subpopulations in the hippocampus

Interestingly, principal component analysis revealed two distinct clusters originating from the hippocampus that were not detected in our previous analysis of hippocampal tissues [12] (Figs. 1a and 2a). The hippocampus is composed of several subregions consisting of four “cornu ammonis” regions and the dentate gyrus. We hypothesized that our samples represented the specific pyramidal and granule neurons within these respective subregions. We tested this hypothesis by identifying autosomal hippocampal CG-DMRs (n = 11,702) between these two clusters (Fig. 2b, Additional file 10: Table S9). GREAT analysis of the top 2000 hippocampal CG-DMRs showed enrichment in neurogenesis and generation of neurons (Additional file 9: Table S8). As adult neurogenesis occurs in the dentate gyrus, these data suggest that some of these samples originated from that particular subregion. Gene expression data from [29,30,31] were used to compile a list of 75 genes specifically expressed in dentate gyrus granule neurons (Additional file 11: Table S10) and we intersected hippocampal CG-DMRs with these genes and their promoters (TSS ± 4 kb). We identified 117 hippocampal CG-DMRs overlapping these genes and found that in 12 of the 18 hippocampus samples these marker genes are hypomethylated compared to the other 6 hippocampus samples and the other brain tissues examined (Fig. 2c). Specific examination of the PROX1 gene, a marker of dentate gyrus granule neurons, reveals hypomethylation in the promoter and throughout the gene body in these 12 samples providing strong evidence that these samples were enriched for dentate gyrus neurons (Fig. 2d). This group of samples is referred to as dentate gyrus samples throughout the rest of the study bringing the total number of brain regions analyzed to nine.

Differential methylation reveals a subset of hippocampus samples originate from the dentate gyrus. a Principal component analysis of distances derived from average autosomal CpG methylation in 1-kb bins. Data shown are from this study and from [11] as indicated. b Hierarchical clustering of hippocampus samples based on the average methylation per sample in the CG-DMRs identified between the two hippocampus groups. c Hierarchical clustering of hippocampus samples based on the average methylation per sample of hippocampal CG-DMRs overlapping dentate gyrus marker genes. The primary marker of the dentate gyrus, PROX1, is boxed. d Example hippocampal CG-DMRs in the PROX1 gene with average methylation values calculated from NeuN+ nuclei isolated from indicated tissue groups and regions of differential methylation shaded pink

MCH DMRs

Non-CpG methylation (mCH) is widespread in human neurons, and mCH over gene bodies and regulatory elements is generally associated with repression [1, 12, 32]. Interestingly, reduced mCH specifically at neuronal enhancers has recently been associated with Alzheimer’s disease pathology [33]. Paradoxically, mCH has also been associated with lowly transcribed genes involved in neuronal development [34] as well as genes escaping X inactivation [35]. Given the importance of mCH in neuronal development and disease, we performed a differential analysis of mCH across our 5 tissue groups. We identified a total of 264,868 CH-DMRs across all contexts (CA, CT, CC) and strands (+, −) covering a third of the genome (1.0 Gb) (Additional file 12: Table S11). This result represents a > 10-fold increase in the number of CH-DMRs identified across the brain compared to our previous work [12]. In that study, we demonstrated high correlations among strands and contexts for mCH therefore, we use mCA(+) to represent mCH in this study. Global analysis of mCA(+) by principal component analysis revealed segregation of samples based on tissue group though not to the same degree as CpG methylation (Fig. 3a). CH-DMRs were 3.5 times broader than CG-DMRs (3839 vs. 1086 bp, respectively) and were enriched in CG-DMRs with 67,979 (25%) CH-DMRs overlapping 118,621 CG-DMRs (65%). However, CH-DMRs showed little enrichment for genic or regulatory features and were depleted in CpG islands (Fig. 3b). CH-DMRs in the CA(+) context had a median methylation difference of 5.8% with 3195 having a methylation difference > 10%. These highly divergent CH-DMRs (Fig. 3b “> 10%”) were particularly enriched in genic/intronic and enhancer regions. Results from our CH-DMR analysis among basal ganglia tissues (152,056 CH-DMRs) and between the two hippocampal clusters (100,757 CH-DMRs) were similar to the 5-group CH-DMR analysis (Additional file 13: Table S12). CH-DMRs also showed a slight enrichment in open chromatin across all brain regions analyzed in [11] (Fig. 3c). Interestingly, hippocampal and basal ganglia CH-DMRs did not show similar enrichments, but were actually depleted in regions of open chromatin in some tissues. Consistent with CG-DMRs, the highly divergent CH-DMRs were generally hypermethylated in basal ganglia tissues compared to the others (Fig. 3d). CH-DMRs exhibit a high degree of overlap among analyses performed using the 5 tissue groups, basal ganglia samples, and hippocampus samples this is also true for CG-DMRs (Fig. 3e, top). Additionally, CH-DMRs show substantial overlap with CG-DMRs as we previously reported [12] (Fig. 3e, bottom). We can detect many additional CH- and CG-DMRs when looking only among basal ganglia tissues or between hippocampus groups. An example CH-DMR is shown within the gene body of NRGN, which encodes the brain-specific protein neurogranin, recently identified as a cerebral spinal fluid biomarker for Alzheimer’s disease [36] (Fig. 3f).

Differential non-CpG methylation across functionally distinct brain regions. a Principal component analysis of distances derived from average autosomal, plus strand CA (mCA+) methylation in 1-kb bins. b Heatmap representing log2 enrichment of CA, CT, and CC within CH-DMRs compared to the rest of the genome for indicated features including CG-DMRs identified in this study. Gene models from GENCODEv26 (promoters, intronic, exonic, 5′UTR, 3′UTR, intergenic), CpG islands (CGIs) and related features from UCSC (shores, shelves, OpenSea), putative enhancer regions (enhancers and high confidence enhancers from PsychENCODE [18] and H3K27ac [19]). 5-group = CH-DMRs between all 5 groups 5-group > 10% = 3195 CH-DMRs from 5-group comparison with mean CA-DMR methylation difference > 10%. c As in b, showing enrichment in regions of open chromatin in NeuN− and NeuN+ nuclei and NeuN+ nuclei isolated from the indicated brain regions (PV cortex, primary visual cortex Med. thalamus, mediodorsal thalamus) from [10]. d Hierarchical clustering of samples based on the average CA(+) methylation per sample in the CA-DMRs with > 10% methylation difference among the 5 tissue groups. e Venn diagrams illustrating intersections between CH- and CG-DMRs identified between different analyses. f Example CA-DMR over NRGN with both strands and CG-DMRs (mCG(S) obtained from small smoothing window) and blocks (mCG(L) obtained from large smoothing window). Average methylation values calculated from NeuN+ nuclei isolated from indicated tissue groups. Regions of differential methylation are shaded in pink

Identification of VMRs in neurons isolated from human brain tissues

Interindividual variation in DNA methylation has been of interest to many groups and the GTEx sample collection allowed us to explore tissue-specific methylation variability at a genome-wide scale previously not possible. VMRs are loci that are highly variable among individuals within a given tissue type [21, 22]. As a matter of clarification, the word “variability” has been used in other work to refer to changes in DNA methylation between tissues [37], which is not the meaning of VMR used here. Prior studies of methylation variability in brain tissues have been limited to targeted genomic regions (Illumina 450k array) [38] or few individuals [39] using a single brain region. Systemic methylation variability can be driven by genetic effects (methylation QTLs in cis and/or trans), occur independently as metastable epialleles [40], be caused by environmental exposures, or be confounded by cell type heterogeneity. As we have previously shown [12], much of the variability due to cell type heterogeneity is removed upon isolating NeuN+ nuclei from brain tissues. However, proportions of neuronal subpopulations between brain regions and among individuals could still contribute to variability measurements. Using EpiDISH [41], we estimated the NeuN+ proportions in our samples. For reference data, we used sites of differential methylation between NeuN+ and NeuN− nuclei isolated from orbitofrontal cortex identified in Kozlenkov et al. [42]. We filtered the 51,412 CpG sites they identified as “neuronal undermethylated” and “glial undermethylated” for |Δβ| > 0.7 resulting in 426 sites. We then eliminated CpGs that overlapped DMRs identified in our 8-group and hippocampal analyses resulting in 201 reference CpGs. This step was critical to eliminate variation due to known region-specific neuronal methylation differences. We found that only four hippocampus samples had any evidence of glial (NeuN−) contamination thus providing independent validation of our sorting efficiency (Additional file 1: Fig. S2a). Only one of these was less than 97% neuronal with an estimate of 87%.

We identified VMRs by determining the 99th percentile of standard deviation of methylation values in each tissue and applying the lowest standard deviation value (SD = 0.095) as a single cutoff for all tissues (Additional file 1: Fig. S2b). Using the same SD cutoff allows different tissues to have different numbers of VMRs rather than taking the top most variable regions. This strategy allows for the possibility that some tissues are more variable than others. We identified a total of 81,130 unique VMRs containing > 10 CpGs and covering 159 Mb across all nine brain regions, lung, and thyroid (Fig. 4a, Table 1, Additional file 1: Fig. S2, Additional file 14: Table S13). The majority of VMRs are shared among two or more tissues (Fig. 4b, “Shared VMR”) with 333 shared among all brain regions. Of those, 202 are “ubiquitous” VMRs, regions of variability shared among all tissues including lung and thyroid (Fig. 4c). Remarkably, an average of 24% of the VMRs identified in each tissue are unique to that tissue and we provide examples of tissue-specific VMRs (Fig. 4b). To quantify the effect size of the variability, we used the range of the per-sample, across-region average methylation. The median effect size is 35% with almost all VMRs having an effect size greater than 20% and some reaching 50% or higher (Additional file 1: Fig. S2c).

Methylation variability across brain and non-brain tissues. a VMRs shared among two non-brain tissues and neuronal nuclei from distinct brain regions. The number of VMRs in each intersection is listed at the bottom of the plot. b Examples of shared and tissue-specific VMRs in the amygdala (pink), caudate (gray), and hypothalamus (green). VMRs are shaded pink in the tissue they were identified in and gray in tissues where they were not considered a VMR SNPs are indicated when present. c Example of a ubiquitous VMR shared across brain and non-brain tissues as indicated. d Heatmap representing log2 enrichment of the union of all VMRs identified, ubiquitous VMRs, tissue-specific VMRs, and all VMRs identified in each tissue as indicated compared to the rest of the genome for genomic features. Gene models from GENCODEv26 (promoters, intronic, exonic, 5′UTR, 3′UTR, intergenic), CpG islands (CGIs) and related features from UCSC (shores, shelves, OpenSea), putative enhancer regions (enhancers and high confidence enhancers from PsychENCODE [18] and H3K27ac [19]). 5-group CG-DMRs = CG-DMRs identified among all 5 tissue groups

Almost all (97%) VMRs overlapped a CG-DMR with 35–50% of VMRs fully contained within a CG-DMR (Table 1). This percentage drops to 18–20% for VMRs identified in the lung and thyroid, which is expected as these two tissues were not included in CG-DMR analyses. This overlap is reflected in the enrichment of VMRs for CG-DMRs, which is less for the lung and thyroid for the reason stated above (Fig. 4d). This can be visualized in Fig. 4b (far right panels) where a VMR is present in the hypothalamus (bottom, green) and the mean methylation is significantly different than that in the amygdala (top, pink) or caudate (middle, gray) thus constituting a CG-DMR. These tissue-specific regions of methylation variability were enriched in putative regulatory regions including enhancer- and transcription-associated chromHMM states (Fig. 4d, Additional file 1: Fig. S3a). VMRs were found across all autosomes (Additional file 1: Fig. S3b) including the MHC region of chromosome 6 which is known to be highly variable. The MHC region, as well as the pericentromeric region of chromosome 20, harbored more ubiquitous VMRs than any other genomic region similar to previous results [39] (Additional file 1: Fig. S4). In contrast to most VMRs, ubiquitous VMRs were particularly enriched in CpG islands and shores (Fig. 4d).

When considering tissue-specific VMRs, we first focused on the amygdala as 32% of VMRs identified in this tissue were unique to the amygdala. The amygdala displayed the highest interindividual variability among all tissues measured and this was evident both by principal component analysis (Fig. 1a) and in the increased number of total VMRs identified (Table 1). We hypothesized that, similar to the hippocampus, distinct subregions of the amygdala were isolated from these individuals resulting in increased neuronal heterogeneity. For tissue-specific VMRs, we cannot distinguish true variation from that due to cellular heterogeneity, which led us to investigate how large a contribution heterogeneity makes. Neuroimaging analyses of anatomical and functional connectivity have subdivided the amygdala into as many as 9 distinct subnuclei [43,44,45]. These subnuclei differ in strength of connectivity to other brain regions including the hypothalamus, hippocampus, and cortical regions. For example, a recent neuroimaging study found that the basolateral nucleus displayed stronger connections to the hypothalamus and visual cortex than the centrocortical nucleus which showed stronger connections to the primary motor cortex [46]. As these categorizations are based primarily on neuroimaging data, molecular features of these subregions have yet to be elucidated in the human amygdala. However, single cell RNA-seq of the medial amygdala in mouse led to the identification of 16 distinct neuronal subtypes [47]. We examined methylation within 1 kb of the human homologs of the 262 genes used to cluster the neurons and found 649 VMRs, 116 of which were specific to the amygdala. No tissue-specific VMRs from any other brain region were detected near these genes. Hierarchical clustering of amygdala neuronal samples based on these 649 VMRs reveals three groups suggesting that these samples may have originated from distinct subnuclei within the amygdala (Additional file 1: Fig. S5a,b). VMRs within the SLC17A7 gene, a marker of glutamatergic neurons, are shown as an example of variable methylation among these three sample groups (Additional file 1: Fig. S5c). These data strongly suggest that variability among amygdala samples is driven by neuronal subtype differences among the subregions sampled. We were unable to identify VMRs within these three distinct groups as only 3–4 individuals were in each subgroup.

When we consider those VMRs that are not tissue-specific, but are shared among at least one other brain region (as shown in Fig. 4a), it is unlikely those VMRs are due to neuronal heterogeneity, because they are shared between brain regions with distinct neuronal populations. Therefore, the variability of these regions must be shared among different cell types which suggests they have some common biological function. Further analysis of the 949 VMRs shared solely between the amygdala and hypothalamus revealed an enrichment for neurotransmitter transport genes, particularly in the SLC family (Additional file 15: Table S14). There is a VMR

3 kb upstream of the TSS of SLC32A1 (Fig. 4b, “shared VMR”), which is expressed in GABAergic neurons and mediates uptake of GABA and glycine to synaptic vesicles [48]. These shared VMRs are also found near SLC6A1 (− 15 kb) and SLC6A11 (+ 1.4 kb), two other GABA transporters for neurons and glia, respectively, as well as upstream of SLC6A3 (− 52 kb), a dopamine transporter important in the pathogenesis of psychiatric disorders [49]. Among the VMRs identified in the other brain regions, 77–95% were shared among two or more. These shared VMRs could be important regions for integrating signaling inputs from neuronal crosstalk.

DMRs and VMRs are enriched for heritability of brain-linked traits

We and others have shown a strong association between differential epigenetic features and neurological, neuropsychiatric, and behavioral-cognitive phenotypes [4, 11, 12]. Using stratified linkage disequilibrium score regression [50], we asked if the VMRs we identified in each brain region were also associated with brain-linked traits (Additional file 16: Table S15). First, we replicated our previous findings that regions of differential CpG methylation are enriched for heritability of brain-linked traits including schizophrenia, neuroticism, and depressive symptoms (Fig. 5a Additional file 1: Fig. S6, Additional file 17: Table S16). In addition, CG-DMRs and those identified among basal ganglia regions showed significant enrichment for heritability of attention deficit hyperactivity disorder (ADHD). VMRs identified in the hypothalamus were also enriched for the heritability of ADHD, while VMRs identified in the amygdala, anterior cingulate cortex, and hippocampus were significantly enriched for the heritability of schizophrenia (Fig. 5b). Amygdala VMRs showed a greater enrichment than CG-DMRs (6.5 vs 4.6) though they cover

75% less of the genome than CG-DMRs (Additional file 17: Table S16).

Neuronal CG-DMRs and VMRs are highly enriched for explained heritability of multiple psychiatric, neurological, and behavioral-cognitive traits. Results from running stratified linkage disequilibrium score regression using 30 GWAS traits with 97 baseline features and either DMRs (a) or VMRs (b) identified in this study. Only feature-trait combinations with a coefficient z-score significantly larger than 0 (one-sided z-test with alpha = 0.05, P-values corrected within each trait using Holm’s method) are shown. Enrichment score (y-axis) and coefficient z-score (x-axis) from running this analysis for each of the indicated methylation features combined with the baseline features are plotted

Genetic contributions to DNA methylation variability

The most likely explanation for a genetic contribution to DNA methylation variability is a genetic polymorphism inside the VMR region. Roughly 25% of VMRs do not overlap any SNPs with a minor allele frequency (MAF) > 0.1 across our samples (Additional file 18: Table S17).

Genotype data from all but two individuals were available from GTEx v8 [16] leaving 6–20 individuals per tissue, which is too few to conduct a rigorous methylation QTL analysis. However, we did identify several examples of VMRs that overlap one or more SNPs that are associated with altered methylation (Fig. 6a). One example is shown within the MYO3A gene which lies upstream of GAD2, a primary regulator of GABA synthesis that has been associated with schizophrenia [53], bipolar disorder [54], and major depression [55]. There is at least one GAD2 enhancer located within the MYO3A gene, though it is

150 kb away from this VMR [56]. To examine possible genetic contributions to our observed methylation variability, we asked if mQTLs previously identified from brain tissues were enriched in our VMRs (Fig. 6b). The mQTL data we chose were generated using 450k arrays on samples from the bulk hippocampus (n = 110) [51], fetal brain (n = 166) [6], and bulk temporal cortex (n = 44) that was also sorted into NeuN+ (n = 18) and NeuN− (n = 22) nuclei [52]. We found the greatest enrichments in all datasets within our ubiquitous VMRs consistent with our assertion that methylation variability in this class of VMRs is genetically driven. We also compared our VMRs to two existing datasets [38, 39] that examine genetic and environmental contributions to methylation variation. Garg et al. [38] profiled 58 NeuN-sorted frontal cortex samples using the Illumina 450k array, and identified 1136 neuronal VMRs, 996 of which were also detected in our analysis. In addition, they identified 149 VMRs in common among blood, brain, and fibroblast samples and while 142 of these were also identified in our analysis, only 18 were present in our set of ubiquitous VMRs. Gunasekara et al. [39] profiled DNA methylation in three tissues from 10 GTEx donors to identify regions where interindividual methylation variation was not tissue-specific. They identified 9926 correlated regions of systemic interindividual variation (CoRSIVs). We found that our VMRs were enriched for VMRs identified by Garg et al. [38] as well as for CoRSIVs [39] (Fig. 6c). We detected 16% (1588/9926) of CoRSIVs in our analysis. As CoRSIVs are, by definition, consistent across the three tissues sampled (heart, thyroid, and cerebellum), we were unsurprised to find that the enrichment in these regions was lower for tissue-specific VMRs.

Genetic contributions to methylation variability. a Example of a SNP associated with altered methylation within MYO3A showing average methylation values for each genotype (indicated) for all tissues. The VMRs associated with this SNP are shaded pink and the SNP is indicated. b Heatmap representing log2 enrichment of CpGs within VMRs and tissue-specific VMRs compared to the rest of the genome for previously published mQTL datasets [6, 51, 52], neuronal VMRs [35], and correlated regions of systemic interindividual variations (CoRSIVs) [36]. Only significant enrichments/depletions are shown non-significant combinations are shaded gray


Can the value of heritability be greater than 1? - Biology

"Heritability" is a term used in many articles and through much of the scientific literature and invariably promotes the idea that it relates specifically to inherited traits. As a result, it is often assumed that the heritability of a particular trait relates to how much influence genetics has on the trait manifesting in an individual.

However, that isn't what it means.

Heritability attempts to address the relationship between nature (genetics) and nurture (environment), so that as each changes, the variation between individuals within a population can be estimated based on these influences. In this context, "environment" simply represents everything external to the genome that could effect expression.

Therefore the first significant aspect of heritability that must be understood is that it tells us nothing about individuals. It is strictly an estimate of the variations that occur within populations. If heritability is applied to an individual it is a meaningless concept [since an individual cannot be said to vary with anything].

It also doesn't tell us anything about the specific influence of genes on any particular trait, since that would be the result of inheritance. We also need to understand that a trait is something that is "selectable". In other words, there exists a possibility that outcomes can vary in the expression of a particular trait. This follows from the Mendelian view of inheritance where genes are represented as two alleles [dominant and recessive], so that particular combinations would produce certain outcomes. Therefore if there is no variation in the alleles, then everyone has the same genes and heritability would be zero. Adaptations like having a heart or a stomach are not selectable (too many genes and interactions) and therefore tell us nothing about heritability. The primary difference is that adaptations represent the cumulative effect of changes over time that have gone to fixation in a population. As a result, there is no "selection" that would determine "heart or no heart". Therefore we can consider that the heart is an adaptation, while the risk of heart disease is a trait.

If we assume that both genetics and the environment influence the traits present in an organism, then we must account for which is the greater influence on differences between organisms (1).

So let's conduct a thought experiment to illustrate what is being evaluated.

Assume that we can create an environment that is identical in every aspect for a particular population of organisms. They develop and grow and as they reach adulthood we observe differences in the traits that they manifest. Since the environment exerts an identical influence on the organisms, then we can rule out the environment as being a factor [in other words, it will affect them all equally]. From this we can conclude that whatever differences exist between organisms [variance] must be the result of genes alone.

In this case, the heritability would be 1.0 or 100%, indicating that only the genes are responsible for the variation we see between individuals.

Similarly we could conduct another thought experiment where we take individuals that are genetically identical [clones] and subject them to various environments. When we examine the traits, whatever variation exists cannot be due to the genes [because they are identical] therefore, the variation is solely affected by the environment. In this case the heritability would be 0.0 or 0% (2).

However, what makes this concept a bit difficult is when we consider actual numbers within this context. For example, let's consider that height may have a heritability of 0.90. What this means is that while there is an influence exerted by environmental factors (such as nutrients) in the height of an individual, the major portion of the influence is exerted by the genes. More importantly, it really tells us that the major influence in explaining the DIFFERENCES between individuals is accounted for in this fashion.

Note that it tells us nothing about what gave rise to the particular height for any particular individual, but rather what explains the differences between individuals within a particular population.

Looking at the plants in the diagram, we have a uniform environment on the left with nutrients and lighting, so therefore the variation in height is due solely to genes (heritability = 100%). Using the same seeds, we change the environment to a poor nutrient level, and this time we have stunted plants whose variation is 100% affected by the genes, since the environment is still constant.
This gives rise to the condition that " we can have total heritability within groups,substantial variation between groups, but no genetic difference between the groups". In both cases, the environment is essentially held constant, so the variation in height is solely due to the genes, hence 100% heritability. However, if we now combined both scenarios so that there were also environmental differences between the groups, then the heritability will be the 0.90 value indicated previously because the variation between the groups is not accounted for solely by the genes, but also by the environmental effects that produce the stunting.

What this particular designation provides is a means by which we can begin to estimate or assess the influences that may be present in particular traits. So while genes are responsible for the expression of a trait, heritability is used to determine how specific such an expression is to the genes alone.

As a result, heritability is often used in artificial selection to establish which traits are the most likely to be successfully selected for. The higher the heritability of a trait, the more influence one has in obtaining that trait by selecting the best breeding pair.

One difficulty that arises with heritability is that any considered trait must be demonstrably linked to genetic transmission. This can become problematic when heritability is used to evaluate behavioral traits where the genetic link may be tenuous. In an effort to measure heritability, there is often a reliance on twin studies under the assumption that variances between them must be accountable to environment since they are effectively genetically identical. However, as previously mentioned, this can result in difficult interpretations when the traits in question are purely behavioral. Until such time as behavioral traits can be explicitly linked to genes, any statement regarding heritability must be considered suspect.

============================================
(1) Heritability is ultimately a proportion which is expressed as a value between 0 and 1. As a result, we can find heritability numbers that indicate 0.30 or 0.60 or some such proportion to indicate the influence of the genes in the variability of a particular trait. In short, heritability can be defined as the ratio of variance due to genes to total variance in a population.

It is given that genes are responsible for the trait itself. Heritability attempts to establish the influence of nature (genes) versus nurture (environment) on the variability in a population. Therefore the variation of a trait that has a heritability of .30 can be said to be influenced by genetics at about 30% and the environment at 70%. In addition, such traits must actually be selectable.

(2) Heritability is zero if either there is no variation in the genes [so there can be no variation in their expression due to genes], whether the result is because of genes going to fixation or because of genetic similarities as in twins or clones.


Watch the video: Επιδέρμια: Ψωρίαση και ατοπική δερματίτιδα 16 Δεκεμβρίου 2020 (September 2022).


Comments:

  1. Shakatilar

    I have not seen a more competent presentation for a long time, but you are not completely right everywhere, in 10 minutes such topics do not completely swell

  2. Heskovizenako

    The authoritative point of view, cognitively..

  3. Zololkis

    the satisfactory question

  4. Demothi

    the Remarkable idea and is timely



Write a message