This post is for someone unfamiliar with SciPy who wants to use Python to do basic statistical calculations. More specifically, this post will look at working with common families of random variables—normal, exponential, gamma, etc.—and how to calculate probabilities with these.
All the statistical functions you need will be in the stats
subpackage of SciPy. You could import everything with
>>> from scipy.stats import *
This will make software developers cringe because it’s good software development practice to only import what you need [1]. But we’re not writing software here; we’re using Python as a calculator.
Distributions
Every distribution in SciPy has a location parameter loc
which defaults to 0, and scale parameter scale
that defaults to 1.
The table below lists additional parameters that some common distributions have.
Distribution | SciPy name | Parameters |
---|---|---|
normal | norm |
|
exponential | expon |
|
beta | beta |
shape1 , shape2 |
binomial | binom |
size , prob |
chi-squared | chi2 |
df |
F | f |
df1 , df2 |
gamma | gamma |
shape |
hypergeometric | hypergeom |
M , n , N |
Poisson | poisson |
lambda |
Student t | t |
df |
When using any statistical software its good to verify that the software is using the parameterization that you expect. For example, the exponential distribution can be parameterized by rate or by scale. SciPy does everything by scale. One way to test the parameterization is to calculate the mean. For example, if the mean of an exp(100) random variable is 100, you’re software is using the scale paraemterization. If the mean is 1/100, it’s using the rate.
Functions
For a random variable X from one of the families above, we’ll show how to compute Prob(X ≤ x), Prob(X ≥ x), and how to solve for x given one of these probabilities. And for discrete distributions, we’ll show how to find Prob(X = p).
CDF: Cumulative distribution function
The probability Prob(X ≤ x) is the CDF of X at x. For example, the probability that a standard normal random variable is less than 2 is
norm.cdf(2)
We didn’t have to specify the location or scale because the standard normal uses default parameters. If X had mean 3 and standard deviation 10 we’d call
norm.cdf(2, loc=3, scale=10)
For another example, suppose we’re working with a gamma distribution with shape 3 and scale 5 and want to know the probability of such a random variable taking on a value less than 2. Then we’d call
gamma.cdf(2, 3, scale=5)
The second argument, 2, is the shape parameter.
CCDF: Complementary cumulative distribution function
The probability Prob(X ≥ x) is the complementary CDF of X at x, which is 1 minus the CDF at x. SciPy uses the name sf
for the CCDF. Here “sf” stands for survival function.
Why have both a CDF and CCDF function? One reason is convenience, but a more important reason is numerical accuracy. If a probability p is very near 1, then 1 – p will be partially or completely inaccurate.
For example, let X be a standard normal random variable. The probability that X is greater than 9 is
norm.sf(9)
which is on the order of 10−19. But if we compute
1 - norm.cdf(9)
we get exactly 0. Floating point numbers have 16 figures of precision, and we need 19 to represent the difference between our probability and 1. More on why mathematical libraries have functions that seem unnecessary at first here.
Inverse CDF
SciPy calls the inverse CDF function ppf
for percentile point function.
For example, suppose X is a gamma random variable with shape 3 and scale 5, and we want to solve for the value of x such that Prob(X ≤ x) is 0.7. We could do this with
gamma.ppf(0.7, 3, scale=5)
Inverse CCDF
SciPy calls the inverse CCDF function isf
for inverse survival function. So, for example,
gamma.isf(0.3, 3, scale=5)
should return the same value as the example above because the point where the right tail has probability 0.3 is the same as the point where the left tail has probability 0.7.
Probability mass function
For a discrete random variable X we can compute the probability mass function, the probability that X takes on a value x. Unsurprisingly SciPy calls this function pmf
.
For example, the probability of seeing 6 heads out of 10 flips of a fair coin is computed by
binom.pmf(6, 10, 0.5)
Getting help
To find out more about a probability distribution, call help
with that distribution as an argument. For example, you could call
help(hypergeom)
to read more about the hypergeometric distribution.
For much more information, including topics not addressed here, see the documentation for scipy.stats.
***
[1] The reason to not import more than you need is to reduce namespace confusion. If you see a function foo
, is that the foo
function from package A or from package B. The most common way this comes up in statistics is confusion over the gamma function and the gamma distribution. In SciPy, the gamma function is in scipy.special
while the gamma distribution is in scipy.stats
. If you’ve imported everything from scipy.stats
then gamma
means scipy.stats.gamma
. You could bring the gamma function into your environment by importing it with another name. For example
from scipy.special import gamma as gammaf
would import the gamma function giving it the name gammaf
.
Nice summary. Thanks!