Non-linear Dependence? Mutual Information to the Rescue!

We are all familiar with the idea of a correlation. In the broadest sense of the word, a correlation can refer to any kind of dependence between two variables. There are three widely used tests for correlation:

  • Spearman’s r: Used to measure a linear relationship between two variables. Requires linear dependence and each marginal distribution to be normal.
  • Pearson’s ρ: Used to measure rank correlations. Requires the dependence structure to be described by a monotonic relationship
  • Kendall’s 𝛕: Used to measure ordinal association between variables.

While these three measures give us plenty of options to work with, they do not work in all cases. Take for example the following variables, Y1 and Y2. These might be two variables that vary in a concerted manner.

Perhaps we suspect that a state change in Y1 leads to a state change in Y2 or vice versa and we want to measure the association between these variables. Using the three measures of correlation, we get the following results:

from scipy.stats import pearsonr
from scipy.stats import kendalltau
from scipy.stats import spearmanr

print("Pearson's r:   ", round(pearsonr(Y1, Y2)[0],3), 
      "P:", round(pearsonr(Y1, Y2)[1], 3))
print("Kendall's tau: ", round(kendalltau(Y1, Y2)[0],3), 
      "P:", round(kendalltau(Y1, Y2)[1], 3))
print("Spearman's rho:", round(spearmanr(Y1, Y2)[0],3), 
      "P:", round(spearmanr(Y1, Y2)[1], 3))
Pearson's r:    0.156 P: 0.447
Kendall's tau:  0.051 P: 0.758
Spearman's rho: 0.261 P: 0.198

None of the three measures seem to pick up on the relationship between Y1 and Y2, though this is not surprising – there are inconsistent patterns in the data. Sometimes the two variables increase together, whereas at other times an increase in one is marked by a decrease in the other. We would not expect any of the three measures to pick up on this, but what if we still want to quantify the relationship? Enter mutual information.

Mutual information (MI) gives us an estimate of how much information we can gain on one variable by observing another. It does not assume any type of dependence structure and hence is very useful for cases like this one. MI is closely linked to the concept of entropy and is defined as

where I(X;Y) refers to the MI between random variables X and Y and H(X) refers to the Shannon entropy of X. Therefore, MI is defined as the sum of entropies of the two variables minus their joint entropy. An alternate way of interpreting this is through the given identity showing that MI is the entropy of X minus the conditional entropy of X given Y.

To calculate MI, we need to find the probability mass/density functions of X and Y, as well as their joint mass/density function. For discrete variables

and for continuous variables

Before we put this to the test using our data, we should mention the interpretation of MI. While the correlation measures we looked at earlier are bounded by -1,1, MI does not not have a defined upper boundary. The lower boundary for MI is 0 which occurs in the case that X and Y are independent variables whereas the upper boundary is given by

since we can only have as much MI as is contained in the variable with the lowest entropy. Now, putting this all into practice using the handy implementation in scikit-learn, we get

import numpy as np
from sklearn.feature_selection import mutual_info_classif
from scipy.stats import entropy

def get_entropy(labels, base=None):
    value,counts = np.unique(labels, return_counts=True)
    return entropy(counts, base=base)

mi = mutual_info_classif(Y1.reshape(-1, 1), Y2).item()
upper_bound = min([entropy(Y1), entropy(Y2)])

print("Mutual Information :", round(mi,3))
print("Minimum entropy    :", round(upper_bound,3))
print("Rescaled MI        :", round(mi/upper_bound,3))
Mutual Information : 2.357
Minimum entropy    : 2.859
Rescaled MI        : 0.824

In order to control for the variable entropy of Y1 and Y2, I have rescaled the MI to be between 0 and 1, which makes interpretation a lot easier. This way we get a value of 0.824, which roughly means that 82% of the uncertainty contained in one variable can be removed by observing the other, suggesting that MI really is able to capture the dependence of Y1 and Y2 much better than any of the correlation metrics we tried.

Lastly, a caveat. To calculate MI, we use the probability mass/density functions of the variables. Estimating these is no trivial task and we should make sure that our data are representative of the underlying marginal distributions. Scikit-learn estimates these using k-nearest neighbours distances (this is beyond the scope of this post but more information and papers outlining the estimation approach can be found in the documentation). Since there is randomness involved here, we might want to repeat our calculation several times to ensure that we get a range of MI, rather than using a single value:

import tqdm

MIs = np.zeros((10000))
for i in tqdm.tqdm(range(10000)):
    MIs[i] = mutual_info_classif(Y1.reshape(-1, 1), Y2)/min([entropy(Y1), entropy(Y2)])

print("\n\nRescaled MI:", round(np.mean(MIs)+np.std(MIs), 3), "-", round(np.mean(MIs)-np.std(MIs), 3))
100%|██████████| 10000/10000 [00:37<00:00, 268.41it/s]

Rescaled MI: 0.842 - 0.712

Author