Following on from my previous blopig post, Garrett gave the very helpful suggestion of combining Multiple Comparisons Similarity (MCSim) plots to reduce information redundancy. For example, this an MCSim plot from my previous blog post:

This plot shows effect sizes from a statistical test (specifically Tukey HSD) between mean absolute error (MAE) scores for different molecular featurization methods on a benchmark dataset. Red shows that the method on the y-axis has a greater average MAE score than the method on the x-axis; blue shows the inverse. There is redundancy in this plot, as the same information is displayed in both the upper and lower triangles. Instead, we could plot both the effect size and the p-values from a test on the same MCSim.
First, let’s import the necessary python modules for an example:
import lightgbm as lgbm import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap import numpy as np from scipy.stats import tukey_hsd import seaborn as sns from sklearn.datasets import make_regression from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor from sklearn.model_selection import GroupKFold from sklearn.metrics import mean_absolute_error from tqdm import tqdm
Now, let’s make some dummy data using make_regression from sklearn:
X, y = make_regression(n_samples=1000, n_features=50, noise=0.1)
Sticking with the themes from my previous blopig post, let’s presume this is molecular data and we’ve grouped the molecules based on similarity with Butina:
groups = np.random.randint(0, 10, 1000)
Using repeated GroupKFold to generate many training and testing splits, let’s train and test three different model types on this data: light gradient boosted machines (LGBM), random forest (RF), and gradient boosting (GB):
# Set the number of repeats and folds for train and test splits
repeats = 10
folds = 5
# Create arrays to store the scores
lgbm_scores = np.zeros((repeats*folds))
rf_scores = np.zeros((repeats*folds))
gb_scores = np.zeros((repeats*folds))
# Loop through the repeats
count = 0
pbar = tqdm(total=repeats*folds)
for i in range(repeats):
# Create a GroupKFold object with new random state
splitter = GroupKFold(n_splits=folds, random_state=i, shuffle=True)
# Loop through the folds
for train_index, test_index in splitter.split(X, y, groups=groups):
# Split the data into training and test sets
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Fit and score LGBM model
lgbm_model = lgbm.LGBMRegressor(verbose=-1)
lgbm_model.fit(X_train, y_train)
preds = lgbm_model.predict(X_test)
mae = mean_absolute_error(y_test, preds)
lgbm_scores[count] = mae
# Fit and score RF model
rf_model = RandomForestRegressor()
rf_model.fit(X_train, y_train)
preds = rf_model.predict(X_test)
mae = mean_absolute_error(y_test, preds)
rf_scores[count] = mae
# Fit and score GB model
gb_model = GradientBoostingRegressor()
gb_model.fit(X_train, y_train)
preds = gb_model.predict(X_test)
mae = mean_absolute_error(y_test, preds)
gb_scores[count] = mae
count += 1
pbar.update(1)
pbar.close()
[out]: 100%|██████████| 50/50 [02:11<00:00, 2.63s/it]
We now have an array of test set MAE scores for each of the three models over the 50 train-test splits. We can now compare these scores using the Tukey HSD test to evaluate whether there is a significant difference between the average scores of the models.
# pairwise comparison of models tukey_results = tukey_hsd(lgbm_scores, rf_scores, gb_scores) print(tukey_results)
[out]: Tukey's HSD Pairwise Group Comparisons (95.0% Confidence Interval) Comparison Statistic p-value Lower CI Upper CI (0 - 1) -20.255 0.000 -21.527 -18.983 (0 - 2) -0.379 0.760 -1.651 0.893 (1 - 0) 20.255 0.000 18.983 21.527 (1 - 2) 19.876 0.000 18.604 21.148 (2 - 0) 0.379 0.760 -0.893 1.651 (2 - 1) -19.876 0.000 -21.148 -18.604
There we go, we have performed a pairwise statistical test, producing effect sizes and p-values. We can visualise these values in an MCSim plot, like the one above. However, this time we will plot both the effect sizes and the p-values onto a single plot:
# Create a heatmap of the effect size and p-values
# First, create a figure and axis
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
# Set the headings for the heatmap
headings = ['LGBM', 'RF', 'GB']
# Get the effect sizes to plot
effect_size = tukey_results.statistic.copy()
# Create a mask to hide the lower triangle
mask = np.zeros_like(effect_size).astype(bool)
mask[np.tril_indices_from(mask)] = True
# Plot the effect size heatmap
sns.heatmap(
effect_size, # the effect size matrix
annot=True, fmt='.2f', # show the values in the cells
cmap='coolwarm', # use the coolwarm colormap
ax=ax, # use the axis we created
mask=mask, # only show the upper triangle
vmax=25, vmin=-25, # set the limits of the color scale
cbar_kws={
'label': 'Effect Size', 'orientation': 'vertical',
'location': 'right', 'aspect':40, 'pad': 0.1,
'fraction': 0.023, 'anchor': (0.0, 1.0)
}, # set the colorbar parameters
)
# adjust the colorbar ticks
colorbar = ax.collections[0].colorbar
colorbar.set_ticks([i for i in range(-25, 26, 5)])
# Get the p-values to plot
pvals = tukey_results.pvalue.copy()
# Convert the p-values to discrete values for plotting
pvals[pvals > 0.05] = 0
pvals[(pvals > 0.01) & (pvals <= 0.05)] = 1
pvals[(pvals > 0.001) & (pvals <= 0.01)] = 2
pvals[(pvals > 0.) & (pvals <= 0.001)] = 3
# Create a mask to only show the lower triangle
mask = np.zeros_like(pvals).astype(bool)
mask[np.triu_indices_from(mask)] = True
# Create a custom colormap for the p-values
colors = ["#d3d3d3", "#90ee90", "#008000", "#004d00"] # Custom colors for 0, 1, 2, 3
cmap = ListedColormap(colors)
# Plot the p-values on the same axis
sns.heatmap(
pvals, # the p-value matrix
ax=ax, # plot on the same axis as the effect size
mask=mask, # mask the upper triangle
cmap=cmap, # use the custom colormap
vmin=-0.5, vmax=3.5, # set the limits of the color scale
cbar_kws={
'label': 'P-values', 'orientation': 'horizontal',
'location': 'bottom', 'aspect':40, 'pad': 0.1,
}, # set the colorbar parameters
xticklabels=headings, yticklabels=headings, # set the labels for the axes
)
colorbar = ax.collections[1].colorbar
# adjust the colorbar ticks
colorbar.set_ticks([0, 1, 2, 3,])
# adjust the colorbar tick labels
colorbar.set_ticklabels([
'$5\cdot 10^{-2} < p$',
'$10^{-2} < p < 5\cdot 10^{-2}$',
'$10^{-3} < p < 10^{-2}$',
'$p < 10^{-3}$'
])
# Place ticks on all sides of the colorbar
ax.tick_params(top=True, labeltop=True, right=True, labelright=True, )
fig.tight_layout(rect=[0, 0, .9, 1])
plt.show()

We now have a plot where the lower triangle displays the p-values from the statistical test and the upper triangle displays the effect size from the statistical test, reducing information redundancy. For effect size, red indicates that the method on the y-axis has a higher MAE than the method on the x-axis; blue indicates the inverse. For p-values, green indicates a significant difference (p < 0.05).
