I think it's true to say that most of us, when we are young, dream of growing up to become statisticians. Sadly, very few of us chase that dream, and instead we grow up to become, say, premiership footballers or Grand Prix drivers. Nevertheless, there are a select few who persist, and who consequently spend their working lives looking at random variables, probability density functions, and cumulative distribution functions. Still, my beating heart.
Needless to say, statistics doesn't really do it for me. The problem with statistics, as a branch of mathematics, is that its value derives largely from its practical utility, not the intrinsic interest of its mathematical structures. Whilst statistics has vital application in all branches of science, finance and commerce, it is precisely this universality and utility which divest it of intrinsic interest. Statistics performs a service for other sciences, rather than being of value in its own right.
Nevertheless, any subject can be interesting once you tunnel sufficiently deeply, so consider the following well-known statistical scientific truism:
If a number of independent random variables combine in an additive fashion, then the collective result is a normal distribution. (i.e., a bell-shaped, Gaussian distribution).
If a number of independent random variables combine in a multiplicative fashion, then the collective result is a lognormal distribution. (i.e., a distribution whose logarithm is a normal distribution).
Now, I'm not sure whether this common wisdom is really correct. As far as I can make out, the first clause in a simplified statement of the central limit theorem. This asserts that the sum of a collection of n independent and identically distributed random variables will converge to a normal distribution as n tends to infinity. The central limit theorem has one particularly important implication for measurement science and the estimation of measurement error:
Suppose that the variable to be measured has an arbitrary distribution, and suppose that one measures the value of the variable by taking a collection of sample measurements, each sample consisting of n measurements; if one calculates the mean value from each sample, then the distribution of sample means will converge to a normal distribution, centred upon the true value of the measured variable, as the size of the sample, n, tends to infinity. Hence, whatever the distribution of the variable being measured, whether it is normal or not, the collection of sample means will have a normal distribution. This is crucial, because it enables one to estimate the 95% or 99% confidence interval, (the uncertainty or measurement error), in a measurement estimate, using the simple formulae or tables of values associated with the normal distribution.
So far, so good. But in general, will a sum of independent random variables give a normal distribution? The central limit theorem doesn't entail that it will, for the central limit theorem requires the contributing random variables to be identically distributed. So what happens when a sum of independent random variables with different distributions is taken?
The second assertion, that a lognormal distribution results from random variables combining in a multiplicative fashion, also needs to be tightly qualified. The assertion is largely based upon the following property of the logarithmic function:
log (A x B) = log A + log B
Thus, given a collection of independent random variables with identical distributions, their product will possess a distribution well-approximated by the logarithm of a normal distribution (applying the central limit theorem again). However, what if the collection of variables are not identically distributed?
Moreover, it has been noted that if the number of steps in a multiplicative process is itself subject to a statistical distribution, then the result will not necessarily be a lognormal distribution. For example, if the number of steps in a multiplicative process is subject to a geometric distribution (a discrete version of the exponential distribution), then whilst the body of the distribution will be lognormal, the tails will exhibit power law behaviour. That's interesting.
In answer to your first question, there are versions of CLT with weaker conditions. For example, take a look at Lyapunov's CLT, which requires independence, but does not require identically distributed, replacing this with a statement essentially saying that the random variables do not become too "skewed".
ReplyDeletei laughed outloud at your first line, wonderful - i did actually dream of mastering theoretical statistics when i was about 19, as it was the exam i failed in my Psychology BsC course, and i tried to teach myself Statistics from a 500-age textbook for the retake. i put in about 6 hours a day over the 3 months leading to the exam (and my brain really doesn't like science or numbers, so this was pretty tough), then got there to find the exam only required me to know how to use SPSS, the computer programme. i knew how to do these tests with a calculator, not how to instruct SPSS to do them for me. So i failed for the second time and dropped out completely. But for a while i did actually dream of becoming an adept statistician. Needless to say i forgot almost everything i'd learnt a week after the exam.
ReplyDeleteExcellent work, Anon.
ReplyDeleteDid you get your BSc elberry?
For maximum comic effect, your first line should maybe have read 'I think it's true to say that 87.6% of us, when we are young, dream of growing up to become statisticians.'
ReplyDeleteFirst rule of comedy Spike...
No, luckily i failed the second time and became a bum for a while.
ReplyDelete