The Kullback-Leibler divergence between two probability distributions is a measure of how different the two distributions are. It is sometimes called a distance, but it’s not a distance in the usual sense because it’s not symmetric. At first this asymmetry may seem like a bug, but it’s a feature. We’ll explain why it’s useful to measure the difference between two probability distributions in an asymmetric way.
The Kullback-Leibler divergence between two random variables X and Y is defined as
This is pronounced/interpreted several ways:
- The divergence from Y to X
- The relative entropy of X with respect to Y
- How well Y approximates X
- The information gain going from the prior Y to the posterior X
- The average surprise in seeing Y when you expected X
A theorem of Gibbs proves that K-L divergence is non-negative. It’s clearly zero if X and Y have the same distribution.
The K-L divergence of two random variables is an expected value, and so it matters which distribution you’re taking the expectation with respect to. That’s why it’s asymmetric.
As an example, consider the probability densities below, one exponential and one gamma with a shape parameter of 2.
The two densities differ mostly on the left end. The exponential distribution believes this region is likely while the gamma does not. This means that an expectation with respect to the exponential distribution will weigh things in this region more heavily. In an information-theoretic sense, an exponential is a better approximation to a gamma than the other way around.
Here’s some Python code to compute the divergences.
from scipy.integrate import quad from scipy.stats import expon, gamma from scipy import inf def KL(X, Y): f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x)) return quad(f, 0, inf) e = expon g = gamma(a = 2) print( KL(e, g) ) print( KL(g, e) )
This returns
(0.5772156649008394, 1.3799968612282498e-08) (0.4227843350984687, 2.7366807708872898e-09)
The first element of each pair is the integral and the second is the error estimate. So apparently both integrals have been computed accurately, and the first is clearly larger. This backs up our expectation that it’s more surprising to see a gamma when expecting an exponential than vice versa.
Although K-L divergence is asymmetric in general, it can be symmetric. For example, suppose X and Y are normal random variables with the same variance but different means. Then it would be equally surprising to see either one when expecting the other. You can verify this in the code above by changing the KL
function to integrate over the whole real line
def KL(X, Y): f = lambda x: -X.pdf(x)*(Y.logpdf(x) - X.logpdf(x)) return quad(f, -inf, inf)
and trying an example.
n1 = norm(1, 1) n2 = norm(2, 1) print( KL(n1, n2) ) print( KL(n2, n1) )
This returns
(0.4999999999999981, 1.2012834963423225e-08) (0.5000000000000001, 8.106890774205374e-09)
and so both integrals are equal to within the error in the numerical integration.
> The surprise in seeing Y when you expected X
In expectation, though. It is an interpretation of a distribution, not of single samples.
Justin: Good point. I edited the post to insert “average.” I started to say “expected”, but then the same word would have two meanings in the same sentence. :)
yeah i see that. the idea of expected surprise always got me chuckling.
Hi John,
Could you please clarify how we can conclude that the exponential approximates the gamma better? I didn’t follow the argument with the plot.
Thanks,
Also worth mentioning that KL divergence does not satisfy the triangle inequality. Easy to compute by hand with exponential RVs with different parameters.
Remark 1: There is a natural interpretation which I like if P and Q are distributions over an alphabet of symbols then
KL(P||Q) = is the number of bits per symbol _lost_ compressing a stream of symbols with actual probability distribution P using optimal compression for distribution Q.
Remark2: the symmetrised version Jensen Shannon divergence
JS(P,Q) = 1/2KL(P||1/2P + 1/2Q) + 1/2KL(Q||1/2P + 1/2Q)
which is the number of bits per symbol lost when mixing two streams with symbol distribution P resp. Q and optimally compressing them rather than compressing them each individually.
It is stated in the literature that JS is the _square_ of a metric
https://link.springer.com/article/10.1007%2FBF02517812