Why isn't everything normally distributed?

Adult heights follow a Gaussian, a.k.a. normal, distribution [1]. The usual explanation is that many factors go into determining one’s height, and the net effect of many separate causes is approximately normal because of the central limit theorem.

If that’s the case, why aren’t more phenomena normally distributed? Someone asked me this morning specifically about phenotypes with many genetic inputs.

The central limit theorem says that the sum of many independent, additive effects is approximately normally distributed [2]. Genes are more digital than analog, and do not produce independent, additive effects. For example, the effects of dominant and recessive genes act more like max and min than addition. Genes do not appear independently—if you have some genes, you’re more likely to have certain other genes—nor do they act independently—some genes determine how other genes are expressed.

Height is influenced by environmental effects as well as genetic effects, such as nutrition, and these environmental effects may be more additive or independent than genetic effects.

Incidentally, if effects are independent but multiplicative rather than additive, the result may be approximately log-normal rather than normal.

* * *

Fine print:

[1] Men’s heights follow a normal distribution, and so do women’s. Adults not sorted by sex follow a mixture distribution as described here and so the distribution is flatter on top than a normal. It gets even more complicated when you considered that there are slightly more women than men in the world. And as with many phenomena, the normal distribution is a better description near the middle than at the extremes.

[2] There are many variations on the central limit theorem. The classical CLT requires that the random variables in the sum be identically distributed as well, though that isn’t so important here.

18 thoughts on “Why isn’t everything normally distributed?”

Dave

9 March 2015 at 14:42

Am I wrong to be bothered by the description of heights as being normally distributed when they cannot take on negative values? Or is 0 so many standard deviations from the mean that we would be unlikely to see one (given all the people born throughout history), even if it were possible?

John

9 March 2015 at 14:53

Dave, that’s an example of where the normal approximation breaks down in the extremes. Though as you said, 0 is many standard deviations away from the mean.

As you travel from the mean out into the tails, the first problem you encounter is not that the probability of negative heights is over-estimated, but that the probability of extremely short and extremely tall people is under-estimated. These are rare events, so the absolute error is very small, but the relative error far out in the tail is huge. The normal approximation is perfectly adequate for, say, airlines wanting to estimate how many passengers will have to stoop when entering a plane.

Damien

9 March 2015 at 17:29

For those that are interested in the biological as well as the statistical details, Wood et al. (2014) is the largest genetic study of height thus far (N = 253,288). The main take home message is that “[t]he results are consistent with a genetic architecture for human height that is characterized by a very large but finite number (thousands) of causal variants, located throughout the genome but clustered in both a biological and genomic manner.”

(Disclosure: I am a middle-author on this paper.)

Phgrosjean

10 March 2015 at 03:18

It is a common idea that height of human beings is fitted by a Normal distribution. However, if you think a little bit more to it:
1) Height does not accept negative values, as Dave pointed out. For me, this is the primary alarm sign that the distribution is probably not Normal,

2) I wonder why so few people have tried to fit a Log-Normal on such data as well: it works equally well. A Log-Normal distribution does not always look completely asymmetrical, especially when the mean >> sd. See, for instance legend of Fig. 1 in Limpert et al 2001 (http://bioscience.oxfordjournals.org/content/51/5/341.extract).

3) Growth is essentially a multiplicative phenomenon. This is well-known since the work of Von Bertalanffy and others. Most, if not all serious growth curves are exponential in nature for this reason.

So, given those three arguments, why do people still believe height of humans is a good example of a Normally-distributed variable? For me not, I tend to favour the Log-Norml distribution in this case.

Joseph Levy

10 March 2015 at 03:50

Just want to note that independence is a sufficient condition for normal approximation to hold, but it is not a necessary condition. Sums (or averages) of dependent variables can be distributed approximately normal.

John

10 March 2015 at 06:23

Negative values are over 21 standard deviations from the mean, and so are astronomically unlikely, less than 10^-100. By comparison, there are about 10^80 particles in the observable universe. If that were the only inaccuracy of the normal approximation, no other probability model would fit reality so well.

Arturo Erdely

6 January 2016 at 11:11

“…additive effects is approximately normally distributed” NOT true if variance is no finite, should be added I think.

S Ellison

6 January 2016 at 18:54

Personally, I don’t find it surprising that not everything is normally distributed. Why should any real phenomenon follow a theoretical limiting distribution anyway, never mind a symmetric, infinite-tailed distribution that is exact only in an unachievable limit? The surprise is that so many things _are_ sufficiently near normality for it to be useful!

John

6 January 2016 at 19:56

S Ellison: I agree. The most astonishing thing is that so many things are normal. But once you hear a justification for that via the Central Limit Theorem, the next question is “Then why isn’t everything normal?”

Glen b

20 July 2017 at 21:25

If you look at even one standard deviation from the mean you can see height is not actually symmetric, let alone normal. There’s a big difference between asserting an adequate approximation form some particular kinds of purpose (estimating the proportion of people having to stoop, where a rough approximation at some particular quantile will do fine) and actually *asserting* normality (which you did but which is clearly not the case). The claim should be more consistent with the evidence (e.g. that it’s approximately normal, or sufficiently normal for most common purposes).

Matt

20 July 2017 at 21:44

I don’t like the statement “Genes are more digital than analog”. A hypothetical normal distribution generated by a single gene in a population is caused by the joint interaction of genetics and environment.

Job van der Zwan

21 July 2017 at 07:10

This makes me wonder if in countries where these environmental effects are closer to identical for everyone (for example, everyone having access to good nutrition, or at least everyone (not) eating the things that have the most impact on height) , the genetic factors become more dominant, causing this normal distribution to break down.

Is height normally distributed in the tallest countries in the world? What about the shortest?

Jonathan

21 July 2017 at 09:37

Genes are “digital” but their effect on polygenic traits like height is usually modeled as additive random effects.

John

21 July 2017 at 10:10

Everything depends on context. I wrote a pair of consecutive blog posts, why heights are normally distributed and why they are not. It’s informal language to say that anything is normally distributed. Saying that something is normally distributed means that the normal distribution is an adequate approximation in that context.

carl anderson

28 March 2018 at 11:06

This is a good paper discussing the relative prevalence and causes of lognormal in natural systems: Log-normal Distributions across the Sciences:
Keys and Clues
http://stat.ethz.ch/~stahel/lognormal/bioscience.pdf

Allan Dobbins

25 May 2021 at 14:23

CLT requires IID and finite variance I believe.
Distributions that involve time can’t take on negative values. Still it is marvelous that sums of any IID finite variance RV form a Gaussian distribution for large N. To provide students an intuition pump on this it helps to show them what sum really means — show them the distribution of the sum of two dice for example, which can lead into convolution, Fourier transforms, …

John

25 May 2021 at 14:29

The classic version of the CLT requires random variables to be independent and identically distributed, but there are generalizations that weaken both of these assumptions.

Peymon

25 October 2022 at 14:30

Another way to look at it:

Many phenomena in the nature follow normal distribution as entropy (informally speaking randomness) will be maximized for this distribution. If there are no external bias, most natural phenomena are entirely random.

Comments are closed.