{"id":12390,"date":"2025-03-11T16:26:37","date_gmt":"2025-03-11T16:26:37","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=12390"},"modified":"2025-03-11T16:33:02","modified_gmt":"2025-03-11T16:33:02","slug":"combining-multiple-comparisons-similarity-plots-for-statistical-tests","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2025\/03\/combining-multiple-comparisons-similarity-plots-for-statistical-tests\/","title":{"rendered":"Combining Multiple Comparisons Similarity plots for statistical tests"},"content":{"rendered":"\n<div class=\"wp-block-jetpack-markdown\"><p>Following on from my previous <a href=\"https:\/\/www.blopig.com\/blog\/2024\/12\/visualising-and-validating-differences-between-machine-learning-models-on-small-benchmark-datasets\">blopig post<\/a>, <a href=\"https:\/\/www.blopig.com\/blog\/author\/garrett\/\">Garrett<\/a> gave the very helpful suggestion of combining Multiple Comparisons Similarity (MCSim) plots to reduce information redundancy. For example, this an MCSim plot from my previous blog post:<\/p>\n<\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/mcsim.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"600\" height=\"500\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/mcsim.png?resize=600%2C500&#038;ssl=1\" alt=\"\" class=\"wp-image-12068\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/mcsim.png?w=600&amp;ssl=1 600w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/mcsim.png?resize=300%2C250&amp;ssl=1 300w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><\/figure>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>This plot shows effect sizes from a statistical test (specifically Tukey HSD) between mean absolute error (MAE) scores for different molecular featurization methods on a benchmark dataset. Red shows that the method on the y-axis has a greater average MAE score than the method on the x-axis; blue shows the inverse. There is redundancy in this plot, as the same information is displayed in both the upper and lower triangles. Instead, we could plot both the effect size and the p-values from a test on the same MCSim.<\/p>\n<\/div>\n\n\n\n<!--more-->\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>First, let\u2019s import the necessary python modules for an example:<\/p>\n<\/div>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import lightgbm as lgbm\nimport matplotlib.pyplot as plt\nfrom matplotlib.colors import ListedColormap\nimport numpy as np\n\nfrom scipy.stats import tukey_hsd\nimport seaborn as sns\n\nfrom sklearn.datasets import make_regression\nfrom sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor\nfrom sklearn.model_selection import GroupKFold\nfrom sklearn.metrics import mean_absolute_error\n\nfrom tqdm import tqdm<\/pre>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>Now, let\u2019s make some dummy data using <code>make_regression<\/code> from sklearn:<\/p>\n<\/div>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">X, y = make_regression(n_samples=1000, n_features=50, noise=0.1)<\/pre>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>Sticking with the themes from my previous blopig post, let\u2019s presume this is molecular data and we\u2019ve grouped the molecules based on similarity with Butina:<\/p>\n<\/div>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">groups = np.random.randint(0, 10, 1000)<\/pre>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>Using repeated GroupKFold to generate many training and testing splits, let\u2019s train and test three different model types on this data: light gradient boosted machines (LGBM), random forest (RF), and gradient boosting (GB):<\/p>\n<\/div>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Set the number of repeats and folds for train and test splits\nrepeats = 10\nfolds = 5\n\n# Create arrays to store the scores\nlgbm_scores = np.zeros((repeats*folds))\nrf_scores = np.zeros((repeats*folds))\ngb_scores = np.zeros((repeats*folds))\n\n# Loop through the repeats\ncount = 0\npbar = tqdm(total=repeats*folds)\nfor i in range(repeats):\n    # Create a GroupKFold object with new random state\n    splitter = GroupKFold(n_splits=folds, random_state=i, shuffle=True)\n    # Loop through the folds\n    for train_index, test_index in splitter.split(X, y, groups=groups):\n        # Split the data into training and test sets\n        X_train, X_test = X[train_index], X[test_index]\n        y_train, y_test = y[train_index], y[test_index]\n\n        # Fit and score LGBM model\n        lgbm_model = lgbm.LGBMRegressor(verbose=-1)\n        lgbm_model.fit(X_train, y_train)\n        preds = lgbm_model.predict(X_test)\n        mae = mean_absolute_error(y_test, preds)\n        lgbm_scores[count] = mae\n\n        # Fit and score RF model\n        rf_model = RandomForestRegressor()\n        rf_model.fit(X_train, y_train)\n        preds = rf_model.predict(X_test)\n        mae = mean_absolute_error(y_test, preds)\n        rf_scores[count] = mae\n\n        # Fit and score GB model\n        gb_model = GradientBoostingRegressor()\n        gb_model.fit(X_train, y_train)\n        preds = gb_model.predict(X_test)\n        mae = mean_absolute_error(y_test, preds)\n        gb_scores[count] = mae\n\n        count += 1\n        pbar.update(1)\npbar.close()<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"bash\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">[out]: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 50\/50 [02:11&amp;lt;00:00,  2.63s\/it]<\/pre>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>We now have an array of test set MAE scores for each of the three models over the 50 train-test splits. We can now compare these scores using the Tukey HSD test to evaluate whether there is a significant difference between the average scores of the models.<\/p>\n<\/div>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># pairwise comparison of models\ntukey_results = tukey_hsd(lgbm_scores, rf_scores, gb_scores)\nprint(tukey_results)<\/pre>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">[out]: Tukey's HSD Pairwise Group Comparisons (95.0% Confidence Interval)\nComparison  Statistic  p-value  Lower CI  Upper CI\n (0 - 1)    -20.255     0.000   -21.527   -18.983\n (0 - 2)     -0.379     0.760    -1.651     0.893\n (1 - 0)     20.255     0.000    18.983    21.527\n (1 - 2)     19.876     0.000    18.604    21.148\n (2 - 0)      0.379     0.760    -0.893     1.651\n (2 - 1)    -19.876     0.000   -21.148   -18.604<\/pre>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>There we go, we have performed a pairwise statistical test, producing effect sizes and p-values. We can visualise these values in an MCSim plot, like the one above. However, this time we will plot both the effect sizes and the p-values onto a single plot:<\/p>\n<\/div>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># Create a heatmap of the effect size and p-values\n# First, create a figure and axis\nfig, ax = plt.subplots(1, 1, figsize=(8, 8))\n\n# Set the headings for the heatmap\nheadings = ['LGBM', 'RF', 'GB']\n\n# Get the effect sizes to plot\neffect_size = tukey_results.statistic.copy()\n\n# Create a mask to hide the lower triangle\nmask = np.zeros_like(effect_size).astype(bool)\nmask[np.tril_indices_from(mask)] = True\n\n# Plot the effect size heatmap\nsns.heatmap(\n    effect_size, # the effect size matrix\n    annot=True, fmt='.2f', # show the values in the cells\n    cmap='coolwarm', # use the coolwarm colormap\n    ax=ax, # use the axis we created\n    mask=mask, # only show the upper triangle\n    vmax=25, vmin=-25, # set the limits of the color scale\n    cbar_kws={\n        'label': 'Effect Size', 'orientation': 'vertical',\n        'location': 'right', 'aspect':40, 'pad': 0.1, \n        'fraction': 0.023, 'anchor': (0.0, 1.0)\n    }, # set the colorbar parameters\n)\n\n# adjust the colorbar ticks\ncolorbar = ax.collections[0].colorbar\ncolorbar.set_ticks([i for i in range(-25, 26, 5)])\n\n# Get the p-values to plot\npvals = tukey_results.pvalue.copy()\n\n# Convert the p-values to discrete values for plotting\npvals[pvals &amp;gt; 0.05] = 0\npvals[(pvals &amp;gt; 0.01) &amp;amp; (pvals &amp;lt;= 0.05)] = 1\npvals[(pvals &amp;gt; 0.001) &amp;amp; (pvals &amp;lt;= 0.01)] = 2\npvals[(pvals &amp;gt; 0.) &amp;amp; (pvals &amp;lt;= 0.001)] = 3\n\n# Create a mask to only show the lower triangle\nmask = np.zeros_like(pvals).astype(bool)\nmask[np.triu_indices_from(mask)] = True\n\n# Create a custom colormap for the p-values\ncolors = [\"#d3d3d3\", \"#90ee90\", \"#008000\", \"#004d00\"]  # Custom colors for 0, 1, 2, 3\ncmap = ListedColormap(colors)\n\n# Plot the p-values on the same axis\nsns.heatmap(\n    pvals, # the p-value matrix\n    ax=ax, # plot on the same axis as the effect size\n    mask=mask, # mask the upper triangle\n    cmap=cmap, # use the custom colormap\n    vmin=-0.5, vmax=3.5, # set the limits of the color scale\n    cbar_kws={\n        'label': 'P-values', 'orientation': 'horizontal',\n        'location': 'bottom', 'aspect':40, 'pad': 0.1, \n    }, # set the colorbar parameters\n    xticklabels=headings, yticklabels=headings, # set the labels for the axes\n    )\n\ncolorbar = ax.collections[1].colorbar\n# adjust the colorbar ticks\ncolorbar.set_ticks([0, 1, 2, 3,])\n\n# adjust the colorbar tick labels\ncolorbar.set_ticklabels([\n    '$5\\cdot 10^{-2} &amp;lt; p$',\n    '$10^{-2} &amp;lt; p &amp;lt; 5\\cdot 10^{-2}$', \n    '$10^{-3} &amp;lt; p &amp;lt; 10^{-2}$',\n    '$p &amp;lt; 10^{-3}$'\n])\n\n# Place ticks on all sides of the colorbar\nax.tick_params(top=True, labeltop=True, right=True, labelright=True, )\nfig.tight_layout(rect=[0, 0, .9, 1])\nplt.show()<\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/03\/output.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"654\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/03\/output.png?resize=625%2C654&#038;ssl=1\" alt=\"\" class=\"wp-image-12392\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/03\/output.png?w=709&amp;ssl=1 709w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/03\/output.png?resize=287%2C300&amp;ssl=1 287w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/03\/output.png?resize=624%2C653&amp;ssl=1 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>We now have a plot where the lower triangle displays the p-values from the statistical test and the upper triangle displays the effect size from the statistical test, reducing information redundancy. For effect size, red indicates that the method on the y-axis has a higher MAE than the method on the x-axis; blue indicates the inverse. For p-values, green indicates a significant difference (p &lt; 0.05).<\/p>\n<\/div>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":118,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[621,189,227,278],"tags":[721,152],"ppma_author":[760],"class_list":["post-12390","post","type-post","status-publish","format-standard","hentry","category-data-visualization","category-machine-learning","category-python-code","category-statistics","tag-data-visualization","tag-python"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":760,"user_id":118,"is_guest":0,"slug":"sam","display_name":"Sam Money-Kyrle","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/784870e2ed5304f12f11366dad56cbf1c0b9aa63bd80021ae235ba5f30536a12?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12390","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/118"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=12390"}],"version-history":[{"count":3,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12390\/revisions"}],"predecessor-version":[{"id":12399,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12390\/revisions\/12399"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=12390"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=12390"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=12390"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=12390"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}