Normal probability approximation

Posted on 19 December 2024 by John

The previous post presented an approximation

$\int_0^x \exp(-t^2)\, dt \approx \sin(\sin(x))$

for −1 ≤ x ≤ 1 and said that it was related to a probability function. This post will make the connect explicit.

Let X be a normally distributed random variable with mean μ and variance σ². Then the CDF of X is

$P(X \leq x) = \frac{1}{\sqrt{2\pi} \sigma} \int_{-\infty}^x \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\, dt$

So if μ = 0 and σ² = 1/2 then we have the following.

$P(0 \leq X \leq x) = \frac{1}{\sqrt{\pi}} \int_0^x \exp(-t^2)\, dt \approx \frac{1}{\sqrt{\pi}} \sin(\sin(x))$

Here’s Python code to show how good the approximation is.

    from numpy import sin, linspace, sqrt, pi
    from scipy.stats import norm
    import matplotlib.pyplot as plt

    x = linspace(-1, 1, 200)
    X = norm(0, sqrt(0.5))
    plt.plot(x, X.cdf(x) - X.cdf(0))
    plt.plot(x, sin(sin(x))/sqrt(pi))
    plt.legend(["exact", "approx"])
    plt.xlabel("$x$")
    plt.ylabel(r"$P(X \leq x)$")

Here’s the resulting plot:

The orange curve for the plot of the approximation completely covers the blue curve of the exact value. The two curves are the same to within the resolution of the plot. See the previous post for a detailed error analysis.

The Cauchy distribution’s counter-intuitive behavior

Posted on 19 September 2024 by John

Someone with no exposure to probability or statistics likely has an intuitive sense that averaging random variables reduces variance, though they wouldn’t state it in those terms. They might, for example, agree that the average of several test grades gives a better assessment of a student than a single test grade. But data from a Cauchy distribution doesn’t behave this way.

Averages and scaling

If you have four independent random variables, each normally distributed with the same scale parameter σ, then their average is also normally distributed but with scale parameter σ/2.

If you have four independent random variables, each Cauchy distributed with the same scale parameter σ, then their average is also Cauchy distributed but with exact same scale parameter σ.

So the normal distribution matches common intuition, but the Cauchy distribution does not.

In the case of random variables with a normal distribution, the scale parameter σ is also the standard deviation. In the case of random variables with a Cauchy distribution, the scale parameter σ is not the standard deviation because Cauchy random variables don’t have a variance, so they don’t have a standard deviation.

Modeling

Some people object that nothing really follows a Cauchy distribution because the Cauchy distribution has no mean or variance. But nothing really follows a normal distribution either. All probability distributions are idealizations. The question of any probability distribution is whether it adequately captures the aspect of reality it is being used to model.

Mean

Suppose some phenomenon appears to behave like it has a Cauchy distribution, with no mean. Alternately, suppose the phenomenon has a mean, but this mean is so variable that it is impossible to estimate. There’s no practical difference between the two.

Variance

And in the alternate case, suppose there is a finite variance, but the variance is so large that it is impossible to estimate. If you take the average of four observations, the result is still so variable that the variance is impossible to estimate. You’ve cut the theoretical variance in half, but that makes no difference. Again this is practically indistinguishable from a Cauchy distribution.

Truncating

Now suppose you want to tame the Cauchy distribution by throwing out samples with absolute value less than M. Now you have a truncated Cauchy distribution, and it has finite mean and variance.

But how do you choose M? If you don’t have an objective reason to choose a particular value of M, you would hope that your choice doesn’t matter too much. And that would be the case for a thin-tailed probability distribution like the normal, but it’s not true of the Cauchy distribution.

The variance of the truncated distribution will be approximately equal to M, so by choosing M you choose the variance. So if you double your cutoff for outliers that are to be discarded, you approximately double the variance of what’s left. Your choice of M matters a great deal.

Why do medical tests always have error rates?

Posted on 22 July 2024 by John

Most people implicitly assume medical tests are infallible. If they test positive for X, they assume they have X. Or if they test negative for X, they’re confident they don’t have X. Neither is necessarily true.

Someone recently asked me why medical tests always have an error rate. It’s a good question.

A test is necessarily a proxy, a substitute for something else. You don’t run a test to determine whether someone has a gunshot wound: you just look.

A test for a disease is based on a pattern someone discovered. People who have a certain condition usually have a certain chemical in their blood, and people who do not have the condition usually do not have that chemical in their blood. But it’s not a direct observation of the disease.

“Usually” is as good as you can do in biology. It’s difficult to make universal statements. When I first started working with genetic data, I said something to a colleague about genetic sequences being made of A, C, G, and T. I was shocked when he replied “Usually. This is biology. Nothing always holds.” It turns out some viruses have a Zs (aminoadenine) rather than As (adenine).

Error rates may be very low and still be misleading. A positive test for rare disease is usually a false positive, even if the test is highly accurate. This is a canonical example of the need to use Bayes theorem. The details are written up here.

The more often you test for something, the more often you’ll find it, correctly or incorrectly. The more conditions you test, the more conditions you find, correctly or incorrectly.

Wouldn’t it be great if your toilet had a built in lab that tested you for hundreds of diseases several times a day? No, it would not! The result would be frequent false positives, leading to unnecessary anxiety and expense.

Up to this point we’ve discussed medical tests, but the same applies to any kind of test. Surveillance is a kind of test, one that also has false positives and false negatives. The more often Big Brother reads you email, the more likely it is that he will find something justifying a knock on your door.

Do incremental improvements add, multiply, or something else?

Posted on 2 July 2024 by John

Suppose you make an x% improvement followed by a y% improvement. Together do they make an (x + y)% improvement? Maybe.

The business principle of kaizen, based on the Japanese 改善 for improvement, is based on the assumption that incremental improvements accumulate. But quantifying how improvements accumulate takes some care.

Add or multiply?

Two successive 1% improvements amount to a 2% improvement. But two successive 50% improvements amount to a 125% improvement. So sometimes you can add, and sometimes you cannot. What’s going on?

An x% improvement multiplies something by 1 + x/100. For example, if you earn 5% interest on a principle of P dollars, you now have 1.05 P dollars.

So an x% improvement followed by a y% improvement multiplies by

(1 + x/100)(1 + y/100) = 1 + (x + y)/100 + xy/10000.

If x and y are small, then xy/10000 is negligible. But if x and y are large, the product term may not be negligible, depending on context. I go into this further in this post: Small probabilities add, big ones don’t.

Interactions

Now let’s look at a variation. Suppose doing one thing by itself brings an x% improvement and doing another thing by itself makes a y% improvement. How much improvement could you expect from doing both?

For example, suppose you find through A/B testing that changing the font on a page increases conversions by 10%. And you find in a separate A/B test that changing an image on the page increases conversions by 15%. If you change the font and the image, would you expect a 25% increase in conversions?

The issue here is not so much whether it is appropriate to add percentages. Since

1.1 × 1.15 = 1.265

you don’t get a much different answer whether you multiply or add. But maybe you could change the font and the image and conversions increase 12%. Maybe either change alone creates a better impression, but together they don’t make a better impression than doing one of the changes. Or maybe the new font and the new image clash somehow and doing both changes together lowers conversions.

The statistical term for what’s going on is interaction effects. A sequence of small improvements creates an additive effect if the improvements are independent. But the effects could be dependent, in which case the whole is less than the sum of the parts. This is typical. Assuming that improvements are independent is often overly optimistic. But sometimes you run into a synergistic effect and the whole is greater than the sum of the parts.

Sequential testing

In the example above, we imagine testing the effect of a font change and an image change separately. What if we first changed the font, then with the new font tested the image? That’s better. If there were a clash between the new font and the new image we’d know it.

But we’re missing something here. If we had tested the image first and then tested the new font with the new image, we might have gotten different results. In general, the order of sequential testing matters.

Factorial testing

If you have a small number of things to test, you can discover interaction effects by doing a factorial design, either a full factorial design or a fractional factorial design.

If you have a large number of things to test, you’ll have to do some sort of sequential testing. Maybe you do some combination of sequential and factorial testing, guided by which effects you have reason to believe will be approximately independent.

In practice, a testing plan needs to balance simplicity and statistical power. Sequentially testing one option at a time is simple, and may be fine if interaction effects are small. But if interaction effects are large, sequential testing may be leaving money on the table.

Help with testing

If you’d like some help with testing, or with web analytics more generally, we can help.

LET’S TALK

Fitting a line without an intercept term

Posted on 5 June 2024 by John

The other day I was looking at how many lumens LED lights put out per watt. I found some data on Wikipedia, and as you might expect the relation between watts and lumens is basically linear, though not quite.

If you were to do a linear regression on the data you’d get a relation

lumens = a × watts + b

where the intercept term b is not zero. But this doesn’t make sense: a light bulb that is turned off doesn’t produce light, and it certainly doesn’t produce negative light. [1]

You may be able fit the regression and ignore b; it’s probably small. But what if you wanted to require that b = 0? Some regression software will allow you to specify zero intercept, and some will not. But it’s easy enough to compute the slope a without using any regression software.

Let x be the vector of input data, the wattage of the LED bulbs. And let y be the corresponding light output in lumens. The regression line uses the slope a that minimizes

(a x − y)² = a² x · x − 2a x · y + y · y.

Setting the derivative with respect to a to zero shows

a = x · y / x · x

Now there’s more to regression than just line fitting. A proper regression analysis would look at residuals, confidence intervals, etc. But the calculation above was good enough to conclude that LED lights put out about 100 lumens per watt.

It’s interesting that making the model more realistic, i.e. requiring b = 0, is either a complication or a simplification, depending on your perspective. It complicates using software, but it simplifies the math.

[1] The orange line in the image above is the least squares fit for the model y = ax, but it’s not quite the same line you’d get if you fit the model y = ax + b.

Condition on your data

Posted on 28 May 2024 by John

Suppose you design an experiment, an A/B test of two page designs, randomizing visitors to Design A or Design B. You planned to run the test for 800 visitors and you calculated some confidence level α for your experiment.

You decide to take a peek at the data after only 300 randomizations, even though your statistician warned you in no uncertain terms not to do that. Something about alpha spending.

You can’t unsee what you’ve seen. Now what?

Common sense says it matters what you saw. If 148 people were randomized to Design A, and every single one of them bought your product, while 10 out of the 152 people randomized to Design B bought your product, common sense says you should call Design A the winner and push it into production ASAP.

But what if you saw somewhat better results for Design A? You can have some confidence that Design A is better, though not as much as the nominal confidence level of the full experiment. This is what your (frequentist) statistician was trying to protect you from.

When your statistician designed your experiment, he obviously didn’t know what data you’d see, so he designed a process that would be reliable in a certain sense. When you looked at the data early, you violated the process, and so now your actual practice no longer has the probability of success initially calculated.

But you don’t care about the process; you want to know whether to deploy Design A or Design B. And you saw the data that you saw. Particularly in the case where the results were lopsidedly in favor of Design A, your gut tells you that you know what to do next. You might reasonably say “I get what you’re saying about repeated experiments and all that. (OK, not really, but let’s say I do.) But look what happened? Design A is a runaway success!”

In formal terms, your common sense is telling you to condition on the observed data. If you’ve never studied Bayesian statistics you may not know exactly what that means and how to calculate it, but it’s intuitively what you’ve done. You’re making a decision based on what you actually saw, not on the basis of a hypothetical sequence of experiments you didn’t run and won’t run.

Bayesian statistics does formally what your intuition does informally. This is important because even though your intuition is a good guide in extreme cases, it can go wrong when things are less obvious. As I wrote about recently, smart people can get probability very wrong, even when their intuition is correct in some instances.

If you’d like help designing experiments or understanding their results, we can help.

Can you look at experimental results along the way or not?

Posted on 28 May 2024 by John

A B testing

Suppose you’re running an A/B test to determine whether a web page produces more sales with one graphic versus another. You plan to randomly assign image A or B to 1,000 visitors to the page, but after only randomizing 500 visitors you want to look at the data. Is this OK or not?

Of course there’s nothing morally or legally wrong with looking at interim data, but is there anything statistically wrong? That depends on what you do with your observation.

There are basically two statistical camps, Frequentist and Bayesian. (There are others, but these two account for the lion’s share.) Frequentists say no, you should not look at interim data, unless you apply something called alpha spending. Bayesians, on the other hand, say go ahead. Why shouldn’t you look at your data? I remember one Bayesian colleague mocking alpha spending as being embarrassing.

So who is right, the Frequentists or the Bayesians? Both are right, given their respective criteria.

If you want to achieve success, as defined by Frequentists, you have to play by Frequentist rules, alpha spending and all. Suppose you design a hypothesis test to have a confidence level α, and you look at the data midway through an experiment. If the results are conclusive at the midpoint, you stop early. This procedure does not have a confidence level α. You would have to require stronger evidence for early stopping, as specified by alpha spending.

The Bayesian interprets the data differently. This approach says to quantify what is believed before conducting the experiment in the form of a prior probability distribution. Then after each data point comes in, you update your prior distribution using Bayes’ theorem to produce a posterior distribution, which becomes your new prior distribution. That’s it. At every point along the way, this distribution captures all that is known about what you’re testing. Your planned sample size is irrelevant, and the fact that you’ve looked at your data is irrelevant.

Now regarding your A/B test, why are you looking at the data early? If it’s simply out of curiosity, and cannot affect your actions, then it doesn’t matter. But if you act on your observation, you change the Frequentist operating characteristics of your experiment.

Stepping back a little, why are you conducting your test in the first place? If you want to make the correct decision with probability 1 − α in an infinite sequence of experiments, then you need to take into account alpha spending, or else you will lower that probability.

But if you don’t care about a hypothetical infinite sequence of experiments, you may find the Bayesian approach more intuitive. What do you know right at this point in the experiment? It’s all encapsulated in the posterior distribution.

Suppose your experiment size is a substantial portion of your potential visitors. You want to experiment with graphics, yes, but you also want to make money along the way. You’re ultimately concerned with profit, not publishing a scientific paper on your results. Then you could use a Bayesian approach to maximize your expected profit. This leads to things like “bandits,” so called by analogy to slot machines (“one-armed bandits”).

If you want to keep things simple, and if the experiment size is negligible relative to your expected number of visitors, just design your experiment according to Frequentist principles and don’t look at your data early.

But if you have good business reasons to want to look at the data early, not simply to satisfy curiosity, then you should probably interpret the interim data from a Bayesian perspective. As the next post explains, the Bayesian approach aligns well with common sense.

I’d recommend taking either a Frequentist approach or a Bayesian approach, but not complicating things by hybrid approaches such as alpha spending or designing Bayesian experiments to have desired Frequentist operating characteristics. The middle ground is more complicated and prone to subtle mistakes, though we can help you navigate this middle ground if you need to.

If you need help designing, conducting, or interpreting experiments, we can help. If you want/need to look at interim results, we can show you how to do it the right way.

LET’S TALK

A Bayesian approach to proving you’re human

Posted on 28 March 2024 by John

I set up a GitHub account for a new employee this morning and spent a ridiculous amount of time proving that I’m human.

The captcha was to listen to three audio clips at a time and say which one contains bird sounds. This is a really clever test, because humans can tell the difference between real bird sounds and synthesized bird-like sounds. And we’re generally good at recognizing bird sounds even against a background of competing sounds. But some of these were ambiguous, and I had real birds chirping outside my window while I was doing the captcha.

You have to do 20 of these tests, and apparently you have to get all 20 right. I didn’t. So I tried again. On the last test I accidentally clicked the start-over button rather than the submit button. I wasn’t willing to listen to another 20 triples of audio clips, so I switched over to the visual captcha tests.

These kinds of tests could be made less annoying and more secure by using a Bayesian approach.

Suppose someone solves 19 out of 20 puzzles correctly. You require 20 out of 20, so you have them start over. When you do, you’re throwing away information. You require 20 more puzzles, despite the fact that they only missed one. And if a bot had solved say 8 out of 20 puzzles, you’d let it pass if it passes the next 20.

If you wipe your memory after every round of 20 puzzles, and allow unlimited do-overs, then any sufficiently persistent entity will almost certainly pass eventually.

Bayesian statistics reflects common sense. After someone (or something) has correctly solved 19 out of 20 puzzles designed to be hard for machines to solve, your conviction that this entity is human is higher than if they/it had solved 8 out of 20 correctly. You don’t need as much additional evidence in the first case as in the latter to be sufficiently convinced.

Here’s how a Bayesian captcha could work. You start out with some distribution on your probability θ that an entity is human, say a uniform distribution. You present a sequence of puzzles, recalculating your posterior distribution after each puzzle, until the posterior probability that this entity is human crosses some upper threshold, say 0.95, or some lower threshold, say 0.50. If the upper threshold is crossed, you decide the entity is likely human. If the lower threshold is crossed, you decide the entity is likely not human.

If solving 20 out of 20 puzzles correctly crosses your threshold of human detection, then after solving 19 out 20 correctly your posterior probability of humanity is close to the upper threshold and would only require a few more puzzles. And if an entity solved 8 out of 20 puzzles correctly, that may cross your lower threshold. If not, maybe only a few more puzzles would be necessary to reject the entity as non-human.

When I worked at MD Anderson Cancer Center we applied this approach to adaptive clinical trials. A clinical trial might stop early because of particularly good results or particularly bad results. Clinical trials are expensive, both in terms of human costs and financial costs. Rejecting poor treatments quickly, and sending promising treatments on to the next stage quickly, is both humane and economical.

Estimating an author’s vocabulary

Posted on 7 March 2024 by John

How would you estimate the size of an author’s vocabulary? Suppose you have a analyzed the author’s available works and found n words, x of which are unique. Then you know the author’s vocabulary was at least x, but it’s reasonable to assume that the author may have know words he never used in writing, or that at least not in works you have access to.

Brainerd [1] suggested the following estimator based on a Markov chain model of language. The estimated vocabulary is the number N satisfying the equation

$\sum_{j=0}^{x-1}\left(1 - \frac{j}{N}\right)^{-1} = n$

The left side is a decreasing function of N, so you could solve the equation by finding a values of N that make the sum smaller and larger than n, then use a bisection algorithm.

We can see that the model is qualitatively reasonable. If every word is unique, i.e. x = n, then the solution is N = ∞. If you haven’t seen any repetition, you the author could keep writing new words indefinitely. As the amount of repetition increases, the estimate of N decreases.

Brainerd’s model is simple, but it tends to underestimate vocabulary. More complicated models might do a better job.

Problems analogous to estimating vocabulary size come up in other applications. For example, an ecologist might want to estimate the number of new species left to be found based on the number of species seen so far. In my work in data privacy I occasionally have to estimate diversity in a population based on diversity in a sample. Both of these examples are analogous to estimating potential new words based on the words you’ve seen.

[1] Brainerd, B. On the relation between types and tokes in literary text, J. Appl. Prob. 9, pp. 507-5

Detecting the language of encrypted text

Posted on 7 March 2024 by John

Imagine you are a code breaker living a century ago. You’ve intercepted a message, and you go through your bag of tricks, starting with the simplest techniques first. Maybe the message has been encrypted using a simple substitution cipher, so you start with that.

Simple substitution ciphers can be broken by frequency analysis: the most common letter probably corresponds to E, the next most common letter probably corresponds to T, etc. But that’s only for English prose. Maybe the message was composed in French. Or maybe it was composed in Japanese, then transliterated into the Latin alphabet so it could be transmitted via Morse code. You’d like to know what language the message was written in before you try identifying letters via their frequency.

William Friedman’s idea was to compute a statistic, what he dubbed the index of coincidence, to infer the probable language of the source. Since this statistic only depends on symbol frequencies, it gives the same value whether computed on clear text or text encrypted with simple substitution. It also gives the same value if the text has been run through a transposition cipher as well.

(Classical cryptanalysis techniques, such as computing the index of coincidence, are completely useless against modern cryptography. And yet ideas from classical cryptanalysis are still useful for other applications. Here’s an example that came up in a consulting project recently.)

As I mentioned at the top of the post, you’d try breaking the simplest encryption first. If the index of coincidence is lower than you’d expect for a natural language, you might suspect that the message has been encrypted using polyalphabetic substitution. That is, instead of using one substitution alphabet for every letter, maybe the message has been encrypted using a cycle of n different alphabets, such as the Vigenère cypher.

How would you break such a cipher? First, you’d like to know what n is. How would you do that? By trial and error. Try splitting the text into groups of letters according to their position mod n, then compute the index of coincidence again for each group. If the index statistics are much larger when n = 7, you’re probably looking a message encrypted with a key of length 7.

The source language would still leave its signature. If the message was encrypted by cycling through seven scrambled alphabets, each group of seven letters would most likely have the statistical distribution of the language used in the clear text.

Friedman’s index of coincidence, published in 1922, was one statistic that could be computed based on letter frequencies, one that worked well in practice, but you could try other statistics, and presumably people did. The index of coincidence is essentially Rényi entropy with parameter α = 2. You might try different values of α.

If the approach above doesn’t work, you might suspect that the text was not encrypted one letter at a time, even using multiple alphabets. Maybe pairs of letters were encrypted, as in the Playfair cipher. You could test this hypothesis by looking that the frequencies of pairs of letters in the encrypted text, calculating an index of coincidence (or some other statistic) based on pairs of letters.

Here again letter pair frequencies may suggest the original language. It might not distinguish Spanish from Portuguese, for example, but it would distinguish Japanese written in Roman letters from English.

Statistics

Normal probability approximation

The Cauchy distribution’s counter-intuitive behavior

Averages and scaling

Modeling

Mean

Variance

Truncating

Related posts

Why do medical tests always have error rates?

Related posts

Do incremental improvements add, multiply, or something else?

Add or multiply?

Interactions

Sequential testing

Factorial testing

Help with testing

Fitting a line without an intercept term

Related posts

Condition on your data

Can you look at experimental results along the way or not?

Related posts

A Bayesian approach to proving you’re human

Related posts

Estimating an author’s vocabulary

Detecting the language of encrypted text