Publish AI, ML & data-science insights to a global community of data professionals.

A Low Variance Sample Estimator for Mutual Information

Competing with Correlation

Photo by JESHOOTS.COM on Unsplash
Photo by JESHOOTS.COM on Unsplash

This is a follow-on from my article on approximating KL divergence and follows the same ideas. I figured this deserves its own piece though because I’ve not really seen this specific idea anywhere else on the internet.

What is Mutual Information?

First a quick recap, mutual information describes how dependent two random variables are on one another in a single number. If the mutual information is 0, the variables are independent, otherwise there is some dependence. You might be thinking this sounds a lot like correlation, but the key difference is that correlation measures linear dependence, whereas mutual information also accounts for nonlinear dependence. For example, given the random variable X suppose that Y=X². Clearly X and Y are related, but correlation likely wouldn’t pick up on it since the relationship is nonlinear while the mutual information would.

So how exactly does it work? Well, we know from probability theory that two random variables are independent (and so also uncorrelated) if the product of their individual probability density functions equals the joint probability density function:

Image by author: the case for X and Y independent.
Image by author: the case for X and Y independent.

So to measure independence, we really need to find a way of measuring the similarity between the joint probability density function and the product of the individual density functions. Luckily, we have the tool for this from information theory, KL divergence! For brevity’s sake, I won’t go into a deep dive on KL divergence here, but if you want to know more, do see my article on the topic here. The full mutual information is thus defined below:

Image by author
Image by author

Why Estimate in the First Place?

  1. No analytical solution: The full form analytical solution for the mutual information may not be known. For example, this is the case for Gaussian Mixture distributions.
  2. High computational complexity: Calculating the full mutual information often requires summing over the whole distribution space. Using an approximation that doesn’t need to do this is useful as it could be faster.

Criteria for Estimators

Intuitively, an estimator should have similar behaviour to the original metric being estimated. We can measure this similarity in two ways:

  1. Bias: Ideally, the estimator should be unbiased; that is, the expected value of the estimator should be equal to the original metric.
  2. Variance: An unbiased estimator with 0 variance (deterministic) would be exactly equal to the original metric! Of course, this is unrealistic, but ideally the variance should be as low as possible, increasing the likelihood of getting values closer to the original metric.

Estimating Mutual Information

First, we need to work out from which distribution we are getting our samples from in our estimator. Suppose we have the random variables X and Y, at each timestep we will simultaneously sample an x and y. Therefore, the samples are generated from the joint distribution. This is important in dealing with the first criteria for estimators, being unbiased, in knowing which distribution to calculate the expectation over. Let’s start with the standard solution currently used:

Image by author
Image by author

This is good, but it has a high variance since it can take negative values whereas the actual mutual information cannot:

Image by author: when the ratio f_{XY}(x,y)/f_X(x)f_Y(y) is less than 1, the sample mutual information estimator given above is negative 😞
Image by author: when the ratio f_{XY}(x,y)/f_X(x)f_Y(y) is less than 1, the sample mutual information estimator given above is negative 😞

One way to improve this is to add a term that has an expected value of 0 and is negatively correlated with the original approximation above. We propose a solution below:

Image by author
Image by author

This works assuming f_X(x) and f_Y(y) are valid probability functions (masses sum to 1). We can then update our estimator to be:

Image by author
Image by author

Where we can solve for λ to find the minimum variance. Unfortunately, this is case dependent and difficult to calculate analytically. However, we can still find a good compromise by choosing a value of 1. This leads to a semi-positive definite approximate in any case! If we take the ratio f_X(x)fY(y)/f{XY}(x,y) as its own variable and plot it against the approximation value with λ=1, we get the positive definite graph below:

Image by author
Image by author

This yields the final estimator:

Image by author
Image by author

Next time you’re finding yourself needing to use a sample approximation for mutual information, keep this technique in mind!

If you found this article useful, do consider:

Promotions out the way, I do hope you found this article interesting and let me know your thoughts!!


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles