{"id":12062,"date":"2024-12-11T14:21:15","date_gmt":"2024-12-11T14:21:15","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=12062"},"modified":"2024-12-12T15:32:09","modified_gmt":"2024-12-12T15:32:09","slug":"visualising-and-validating-differences-between-machine-learning-models-on-small-benchmark-datasets","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2024\/12\/visualising-and-validating-differences-between-machine-learning-models-on-small-benchmark-datasets\/","title":{"rendered":"Visualising and validating differences between machine learning models on small benchmark datasets"},"content":{"rendered":"\n<title>Introduction<\/title>\n\n\n\n<div id=\"quarto-content\" class=\"page-columns page-rows-contents page-layout-article\">\n\n<main class=\"content\" id=\"quarto-document-content\">\n\n<header id=\"title-block-header\" class=\"quarto-title-block default\">\n<div class=\"quarto-title\">\n\n<\/div>\n\n\n\n<div class=\"quarto-title-meta\">\n\n    <div>\n    <div class=\"quarto-title-meta-heading\">Author<\/div>\n    <div class=\"quarto-title-meta-contents\">\n             <p>Sam Money-Kyrle <\/p>\n          <\/div>\n  <\/div>\n    \n  \n    \n  <\/div>\n  \n<h1 class=\"title\">Introduction<\/h1>\n\n<\/header>\n\n\n<p>An epidemic is sweeping through cheminformatics (and machine learning) research: ugly results tables. These tables are typically bloated with metrics (such as regression and classification metrics next to each other), vastly differing tasks, erratic bold text, and many models. As a consequence, results become difficult to analyse and interpret. Additionally, it is rare to see convincing evidence, such as statistical tests, for whether one model is \u2018better\u2019 than another (something <a href=\"https:\/\/practicalcheminformatics.blogspot.com\/2024\/01\/ai-in-drug-discovery-2023-highly.html\">Pat Walters<\/a> has previously discussed). Tables are a practical way to present results and are appropriate in many cases; however, this practicality should not come at the cost of clarity.<\/p>\n<p>The terror of ugly tables extends to benchmark leaderboards, such as Therapeutic Data Commons (<a href=\"https:\/\/tdcommons.ai\/\">TDC<\/a>). These leaderboard tables do not show:<\/p>\n<ol type=\"1\">\n<li>whether differences in metrics between methods are statistically significant,<\/li>\n<li>whether methods use ensembles or single models,<\/li>\n<li>whether methods use classical (such as Morgan fingerprints) or learned (such as Graph Neural Networks) representations,<\/li>\n<li>whether methods are pre-trained or not,<\/li>\n<li>whether pre-trained models are supervised, self-supervised, or both,<\/li>\n<li>the data and tasks that pre-trained models are pre-trained on.<\/li>\n<\/ol>\n<p>This lack of context makes meaningful comparisons between approaches challenging, obscuring whether performance discrepancies are due to variance, ensembling, overfitting, exposure to more data, or novelties in model architecture and molecular featurisation. Confirming the statistical significance of performance differences (under consistent experimental conditions!) is crucial in constructing a more lucid picture of machine learning in drug discovery. Using figures to share results in a clear, non-tabular format would also help.<\/p>\n<p>Statistical validation is particularly relevant in domains with small datasets, such as drug discovery, as the small number of test samples leads to high variance in performance between different splits. Recent work by <a href=\"https:\/\/chemrxiv.org\/engage\/chemrxiv\/article-details\/672a91bd7be152b1d01a926b\">Ash <em>et al.<\/em> (2024)<\/a> sought to alleviate the lack of statistical validation in cheminformatics by sharing a helpful set of guidelines for researchers. Here, we explore implementing some of the methods they suggest (plus some others) in Python.<\/p>\n\n\n\n<!--more-->\n\n\n\n<title>Imports<\/title>\n\n\n<main class=\"content\" id=\"quarto-document-content\">\n\n<header id=\"title-block-header\" class=\"quarto-title-block default\">\n<div class=\"quarto-title\">\n<h1 class=\"title\">Imports<\/h1>\n<\/div>\n\n\n\n<div class=\"quarto-title-meta\">\n\n    \n  \n    \n  <\/div>\n  \n\n\n<\/header>\n\n\n<div id=\"f9e6a8a8-acf4-4d81-8f3f-17a30ef2139e\" class=\"cell\" data-execution_count=\"255\">\n<details class=\"code-fold\">\n<summary>Imports<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb1\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb1-1\"><a href=\"#cb1-1\" aria-hidden=\"true\"><\/a><span class=\"im\">import<\/span> lightgbm <span class=\"im\">as<\/span> lgb<\/span>\n<span id=\"cb1-2\"><a href=\"#cb1-2\" aria-hidden=\"true\"><\/a><span class=\"im\">import<\/span> matplotlib.pyplot <span class=\"im\">as<\/span> plt<\/span>\n<span id=\"cb1-3\"><a href=\"#cb1-3\" aria-hidden=\"true\"><\/a><span class=\"im\">from<\/span> matplotlib.colors <span class=\"im\">import<\/span> ListedColormap<\/span>\n<span id=\"cb1-4\"><a href=\"#cb1-4\" aria-hidden=\"true\"><\/a><span class=\"im\">import<\/span> numpy <span class=\"im\">as<\/span> np<\/span>\n<span id=\"cb1-5\"><a href=\"#cb1-5\" aria-hidden=\"true\"><\/a><span class=\"im\">import<\/span> seaborn <span class=\"im\">as<\/span> sns<\/span>\n<span id=\"cb1-6\"><a href=\"#cb1-6\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb1-7\"><a href=\"#cb1-7\" aria-hidden=\"true\"><\/a><span class=\"im\">import<\/span> pandas <span class=\"im\">as<\/span> pd<\/span>\n<span id=\"cb1-8\"><a href=\"#cb1-8\" aria-hidden=\"true\"><\/a><span class=\"im\">import<\/span> polaris <span class=\"im\">as<\/span> po<\/span>\n<span id=\"cb1-9\"><a href=\"#cb1-9\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb1-10\"><a href=\"#cb1-10\" aria-hidden=\"true\"><\/a><span class=\"im\">from<\/span> rdkit <span class=\"im\">import<\/span> Chem<\/span>\n<span id=\"cb1-11\"><a href=\"#cb1-11\" aria-hidden=\"true\"><\/a><span class=\"im\">from<\/span> rdkit.Chem <span class=\"im\">import<\/span> Descriptors<\/span>\n<span id=\"cb1-12\"><a href=\"#cb1-12\" aria-hidden=\"true\"><\/a><span class=\"im\">from<\/span> rdkit.DataStructs <span class=\"im\">import<\/span> TanimotoSimilarity<\/span>\n<span id=\"cb1-13\"><a href=\"#cb1-13\" aria-hidden=\"true\"><\/a><span class=\"im\">from<\/span> rdkit.ML.Cluster <span class=\"im\">import<\/span> Butina<\/span>\n<span id=\"cb1-14\"><a href=\"#cb1-14\" aria-hidden=\"true\"><\/a><span class=\"im\">from<\/span> rdkit.ML.Descriptors <span class=\"im\">import<\/span> MoleculeDescriptors<\/span>\n<span id=\"cb1-15\"><a href=\"#cb1-15\" aria-hidden=\"true\"><\/a><span class=\"im\">from<\/span> rdkit.Chem.rdFingerprintGenerator <span class=\"im\">import<\/span> GetMorganGenerator, GetMorganFeatureAtomInvGen<\/span>\n<span id=\"cb1-16\"><a href=\"#cb1-16\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb1-17\"><a href=\"#cb1-17\" aria-hidden=\"true\"><\/a><span class=\"im\">from<\/span> scipy <span class=\"im\">import<\/span> stats<\/span>\n<span id=\"cb1-18\"><a href=\"#cb1-18\" aria-hidden=\"true\"><\/a><span class=\"im\">from<\/span> sklearn.metrics <span class=\"im\">import<\/span> mean_absolute_error, r2_score<\/span>\n<span id=\"cb1-19\"><a href=\"#cb1-19\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb1-20\"><a href=\"#cb1-20\" aria-hidden=\"true\"><\/a><span class=\"im\">from<\/span> tqdm <span class=\"im\">import<\/span> tqdm<\/span>\n<span id=\"cb1-21\"><a href=\"#cb1-21\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb1-22\"><a href=\"#cb1-22\" aria-hidden=\"true\"><\/a><span class=\"im\">from<\/span> sklearn.model_selection <span class=\"im\">import<\/span> GroupKFold<\/span><\/code><\/pre><\/div>\n<\/details>\n<\/div>\n<p><br> Below are the various packages, and their versions, used for this blog.<\/p>\n<div id=\"90fd2616-ca03-42f1-9237-974701ceb60a\" class=\"cell\" data-execution_count=\"315\">\n<div class=\"cell-output cell-output-display\" data-execution_count=\"315\">\n\n\n\n<table id=\"T_6cc2b\" class=\"caption-top table table-sm table-striped small\" data-quarto-postprocess=\"true\">\n<thead>\n<tr class=\"header\">\n<th id=\"T_6cc2b_level0_col0\" class=\"col_heading level0 col0\" data-quarto-table-cell-role=\"th\">Module<\/th>\n<th id=\"T_6cc2b_level0_col1\" class=\"col_heading level0 col1\" data-quarto-table-cell-role=\"th\">Version<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr class=\"odd\">\n<td id=\"T_6cc2b_row0_col0\" class=\"data row0 col0\">lightgbm<\/td>\n<td id=\"T_6cc2b_row0_col1\" class=\"data row0 col1\">4.5.0<\/td>\n<\/tr>\n<tr class=\"even\">\n<td id=\"T_6cc2b_row1_col0\" class=\"data row1 col0\">matplotlib<\/td>\n<td id=\"T_6cc2b_row1_col1\" class=\"data row1 col1\">3.9.2<\/td>\n<\/tr>\n<tr class=\"odd\">\n<td id=\"T_6cc2b_row2_col0\" class=\"data row2 col0\">numpy<\/td>\n<td id=\"T_6cc2b_row2_col1\" class=\"data row2 col1\">1.26.4<\/td>\n<\/tr>\n<tr class=\"even\">\n<td id=\"T_6cc2b_row3_col0\" class=\"data row3 col0\">pandas<\/td>\n<td id=\"T_6cc2b_row3_col1\" class=\"data row3 col1\">2.2.3<\/td>\n<\/tr>\n<tr class=\"odd\">\n<td id=\"T_6cc2b_row4_col0\" class=\"data row4 col0\">polaris<\/td>\n<td id=\"T_6cc2b_row4_col1\" class=\"data row4 col1\">0.8.6<\/td>\n<\/tr>\n<tr class=\"even\">\n<td id=\"T_6cc2b_row5_col0\" class=\"data row5 col0\">rdkit<\/td>\n<td id=\"T_6cc2b_row5_col1\" class=\"data row5 col1\">2024.03.5<\/td>\n<\/tr>\n<tr class=\"odd\">\n<td id=\"T_6cc2b_row6_col0\" class=\"data row6 col0\">scipy<\/td>\n<td id=\"T_6cc2b_row6_col1\" class=\"data row6 col1\">1.14.1<\/td>\n<\/tr>\n<tr class=\"even\">\n<td id=\"T_6cc2b_row7_col0\" class=\"data row7 col0\">seaborn<\/td>\n<td id=\"T_6cc2b_row7_col1\" class=\"data row7 col1\">0.13.2<\/td>\n<\/tr>\n<tr class=\"odd\">\n<td id=\"T_6cc2b_row8_col0\" class=\"data row8 col0\">sklearn<\/td>\n<td id=\"T_6cc2b_row8_col1\" class=\"data row8 col1\">1.6.0<\/td>\n<\/tr>\n<tr class=\"even\">\n<td id=\"T_6cc2b_row9_col0\" class=\"data row9 col0\">tqdm<\/td>\n<td id=\"T_6cc2b_row9_col1\" class=\"data row9 col1\">4.66.5<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<section id=\"dataset\" class=\"level3\">\n<h3 class=\"anchored\" data-anchor-id=\"dataset\">Dataset<\/h3>\n<p>The first thing we need is a benchmark dataset. Luckily, <a href=\"https:\/\/polarishub.io\/\">Polaris<\/a> has numerous benchmarking datasets for drug discovery-related tasks. First, we need to log in to Polaris (which you can do with a GitHub account), and then we need to choose a dataset. If you do not want to log in to Polaris, the original CSV for the dataset used here can be found in this <a href=\"https:\/\/github.com\/molecularinformatics\/Computational-ADME\/blob\/main\/ADME_public_set_3521.csv\">repository<\/a>.<\/p>\n<p>I opted to use a dataset from <a href=\"https:\/\/pubs.acs.org\/doi\/10.1021\/acs.jcim.3c00160\">Fang <em>et al.<\/em> (2023)<\/a> as it fulfills several criteria:<\/p>\n<ol type=\"1\">\n<li>the labels are all drug discovery-related molecular property prediction tasks,<\/li>\n<li>the data is non-federated, i.e., there is only one source (which should reduce noise),<\/li>\n<li>some benchmarks have very few examples, as is typical in drug discovery.<\/li>\n<\/ol>\n<div id=\"dbaa4373-7775-4bd4-853d-07288718e80d\" class=\"cell\" data-execution_count=\"7\">\n<details class=\"code-fold\">\n<summary>Polaris login<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb2\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb2-1\"><a href=\"#cb2-1\" aria-hidden=\"true\"><\/a>po.hub.client.PolarisHubClient().login()<\/span><\/code><\/pre><\/div>\n<\/details>\n<\/div>\n<div id=\"df5c36ec-7ce5-4c7d-b053-bd8d07532d20\" class=\"cell\" data-execution_count=\"9\">\n<details open=\"\" class=\"code-fold\">\n<summary>Display dataset<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb1\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb1-1\"><a href=\"#cb1-1\" aria-hidden=\"true\"><\/a>df <span class=\"op\">=<\/span> dataset.table<\/span>\n<span id=\"cb1-2\"><a href=\"#cb1-2\" aria-hidden=\"true\"><\/a>df <span class=\"op\">=<\/span> df.drop(<span class=\"st\">'UNIQUE_ID'<\/span>, axis<span class=\"op\">=<\/span><span class=\"dv\">1<\/span>)<\/span>\n<span id=\"cb1-3\"><a href=\"#cb1-3\" aria-hidden=\"true\"><\/a>df.head()<\/span><\/code><\/pre><\/div>\n<\/details>\n<div class=\"cell-output cell-output-display\" data-execution_count=\"409\">\n<div>\n\n\n<table class=\"dataframe caption-top table table-sm table-striped small\" data-quarto-postprocess=\"true\" data-border=\"1\">\n<thead>\n<tr class=\"header\">\n<th data-quarto-table-cell-role=\"th\"><\/th>\n<th data-quarto-table-cell-role=\"th\">MOL_SMILES<\/th>\n<th data-quarto-table-cell-role=\"th\">SMILES<\/th>\n<th data-quarto-table-cell-role=\"th\">LOG_HLM_CLint<\/th>\n<th data-quarto-table-cell-role=\"th\">LOG_RLM_CLint<\/th>\n<th data-quarto-table-cell-role=\"th\">LOG_MDR1-MDCK_ER<\/th>\n<th data-quarto-table-cell-role=\"th\">LOG_HPPB<\/th>\n<th data-quarto-table-cell-role=\"th\">LOG_RPPB<\/th>\n<th data-quarto-table-cell-role=\"th\">LOG_SOLUBILITY<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr class=\"odd\">\n<td data-quarto-table-cell-role=\"th\">0<\/td>\n<td>Brc1cc&#8230;<\/td>\n<td>Brc1cc&#8230;<\/td>\n<td>0.886265<\/td>\n<td>2.357933<\/td>\n<td>-0.247518<\/td>\n<td>NaN<\/td>\n<td>NaN<\/td>\n<td>1.536432<\/td>\n<\/tr>\n<tr class=\"even\">\n<td data-quarto-table-cell-role=\"th\">1<\/td>\n<td>Brc1cc&#8230;<\/td>\n<td>Brc1cc&#8230;<\/td>\n<td>0.675687<\/td>\n<td>1.613704<\/td>\n<td>-0.010669<\/td>\n<td>NaN<\/td>\n<td>NaN<\/td>\n<td>1.797475<\/td>\n<\/tr>\n<tr class=\"odd\">\n<td data-quarto-table-cell-role=\"th\">2<\/td>\n<td>Brc1cn&#8230;<\/td>\n<td>Brc1cn&#8230;<\/td>\n<td>2.081607<\/td>\n<td>3.753651<\/td>\n<td>NaN<\/td>\n<td>NaN<\/td>\n<td>NaN<\/td>\n<td>NaN<\/td>\n<\/tr>\n<tr class=\"even\">\n<td data-quarto-table-cell-role=\"th\">3<\/td>\n<td>Brc1cn&#8230;<\/td>\n<td>Brc1cn&#8230;<\/td>\n<td>NaN<\/td>\n<td>NaN<\/td>\n<td>NaN<\/td>\n<td>NaN<\/td>\n<td>NaN<\/td>\n<td>-0.033858<\/td>\n<\/tr>\n<tr class=\"odd\">\n<td data-quarto-table-cell-role=\"th\">4<\/td>\n<td>Brc1nn&#8230;<\/td>\n<td>Brc1nn&#8230;<\/td>\n<td>1.888410<\/td>\n<td>3.492201<\/td>\n<td>-0.235024<\/td>\n<td>NaN<\/td>\n<td>NaN<\/td>\n<td>NaN<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n\n<\/div>\n<\/div>\n<\/div>\n<p><br>The <code>MOL_smiles<\/code> column contains the molecular SMILES strings after standardisation. The <code>SMILES<\/code> column contains the original molecular SMILES strings.<\/p>\n<p>There are six endpoints in this dataset from <em>in vitro<\/em> experimental assays:<\/p>\n<ol type=\"1\">\n<li>LOG_HLM_CLint; a measure of drug clearance by human liver cells,<\/li>\n<li>LOG_RLM_CLint; a measure of drug clearance by rat liver cells<\/li>\n<li>LOG_MDR1-MDCK_ER; a measure of active transport by overexpressed P-glycoprotein 1 (aka multidrug resistance protein 1 [MDR1]) in Madin-Darby canine kidney cells (MDCK) cells,<\/li>\n<li>LOG_HPPB; a measure of plasma protein binding in humans,<\/li>\n<li>LOG_RPPB; a measure of plasma protein binding in rats,<\/li>\n<li>LOG_SOLUBILITY; a measure of aqueous solubility.<\/li>\n<\/ol>\n<p>In total there are 3521 molecules in this dataset:<\/p>\n<div id=\"0b881d4f-3833-4f18-8509-7971fb4b31d8\" class=\"cell\" data-execution_count=\"11\">\n<details class=\"code-fold\">\n<summary>Dataset size<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb5\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb5-1\"><a href=\"#cb5-1\" aria-hidden=\"true\"><\/a><span class=\"ss\">f'Dataset size: <\/span><span class=\"sc\">{<\/span>df<span class=\"sc\">.<\/span>shape<span class=\"sc\">}<\/span><span class=\"ss\">'<\/span><\/span><\/code><\/pre><\/div>\n<\/details>\n<div class=\"cell-output cell-output-display\" data-execution_count=\"11\">\n<pre><code>'Dataset size: (3521, 9)'<\/code><\/pre>\n<\/div>\n<\/div>\n<p>However, not every molecule has a value for every endpoint. Let\u2019s take a look at human plasma protein binding:<\/p>\n<div id=\"a62daaa9-a721-43db-b188-672f4fd04a86\" class=\"cell\" data-scrolled=\"true\" data-execution_count=\"12\">\n<details open=\"\" class=\"code-fold\">\n<summary>Select task<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb7\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb7-1\"><a href=\"#cb7-1\" aria-hidden=\"true\"><\/a>df <span class=\"op\">=<\/span> df[[<span class=\"st\">'MOL_smiles'<\/span>, <span class=\"st\">'LOG_HPPB'<\/span>]]<\/span>\n<span id=\"cb7-2\"><a href=\"#cb7-2\" aria-hidden=\"true\"><\/a>df <span class=\"op\">=<\/span> df.dropna()<\/span>\n<span id=\"cb7-3\"><a href=\"#cb7-3\" aria-hidden=\"true\"><\/a><span class=\"ss\">f'Human plasma protein binding dataset size: <\/span><span class=\"sc\">{<\/span>df<span class=\"sc\">.<\/span>shape<span class=\"sc\">}<\/span><span class=\"ss\">'<\/span><\/span><\/code><\/pre><\/div>\n<\/details>\n<div class=\"cell-output cell-output-display\" data-execution_count=\"12\">\n<pre><code>'Human plasma protein binding dataset size: (194, 2)'<\/code><\/pre>\n<\/div>\n<\/div>\n<p>There are only 194 molecules in this benchmark, meaning performance variance will likely be high between different test splits.<\/p>\n<\/section>\n<section id=\"featurisation\" class=\"level3\">\n<h3 class=\"anchored\" data-anchor-id=\"featurisation\">Featurisation<\/h3>\n<p>Now that we have a dataset of molecules and target labels, we need to featurise the molecules. First, let\u2019s convert the SMILES strings to RDKit molecules and extract the target labels:<\/p>\n<div id=\"f5dc9a7b-70ba-4510-bc38-2feb4aea8e8b\" class=\"cell\" data-execution_count=\"13\">\n<details open=\"\" class=\"code-fold\">\n<summary>Convert SMILES<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb9\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb9-1\"><a href=\"#cb9-1\" aria-hidden=\"true\"><\/a>mols <span class=\"op\">=<\/span> [Chem.MolFromSmiles(i) <span class=\"cf\">for<\/span> i <span class=\"kw\">in<\/span> df.MOL_smiles]<\/span>\n<span id=\"cb9-2\"><a href=\"#cb9-2\" aria-hidden=\"true\"><\/a>y_true <span class=\"op\">=<\/span> df.LOG_HPPB.to_numpy()<\/span><\/code><\/pre><\/div>\n<\/details>\n<\/div>\n<p><br>Three commonly used molecular featurisation methods are physicochemical descriptor vectors (PDV), extended-connectivity fingerprints (ECFP), and functional class fingerprints (FCFP). PDVs are vectors containing numerical molecular features, such as molecular weight and total charge. ECFPs are bit vector representations of 2D molecular topology, where 1 indicates the presence of a substructure, and 0 indicates the absence of a substructure. FCFPs are a variant of ECFPs that group atoms by certain features (e.g., all halogen atoms are labelled the same). See <a href=\"https:\/\/doi.org\/10.1021\/ci100050t\">Rogers and Hahn (2010)<\/a> and <a href=\"https:\/\/doi.org\/10.1093\/bib\/bbad422\">McGibbon <em>et al.<\/em> (2024)<\/a> for more.<\/p>\n<p>First, lets generate PDVs for each molecule with RDKit:<\/p>\n<div id=\"d0e49e70-3587-4f33-abec-3de0748780ac\" class=\"cell\" data-execution_count=\"79\">\n<details open=\"\" class=\"code-fold\">\n<summary>PDV descriptors list<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb10\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb10-1\"><a href=\"#cb10-1\" aria-hidden=\"true\"><\/a><span class=\"co\"># List of RDKit molecular descriptors<\/span><\/span>\n<span id=\"cb10-2\"><a href=\"#cb10-2\" aria-hidden=\"true\"><\/a>descriptor_list <span class=\"op\">=<\/span> [<\/span>\n<span id=\"cb10-3\"><a href=\"#cb10-3\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'BalabanJ'<\/span>, <span class=\"st\">'BertzCT'<\/span>, <span class=\"st\">'Chi0'<\/span>, <span class=\"st\">'Chi0n'<\/span>, <span class=\"st\">'Chi0v'<\/span>, <span class=\"st\">'Chi1'<\/span>, <span class=\"st\">'Chi1n'<\/span>, <span class=\"st\">'Chi1v'<\/span>, <span class=\"st\">'Chi2n'<\/span>, <span class=\"st\">'Chi2v'<\/span>,<\/span>\n<span id=\"cb10-4\"><a href=\"#cb10-4\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'Chi3n'<\/span>, <span class=\"st\">'Chi3v'<\/span>, <span class=\"st\">'Chi4n'<\/span>, <span class=\"st\">'Chi4v'<\/span>, <span class=\"st\">'EState_VSA1'<\/span>, <span class=\"st\">'EState_VSA10'<\/span>, <span class=\"st\">'EState_VSA11'<\/span>,<\/span>\n<span id=\"cb10-5\"><a href=\"#cb10-5\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'EState_VSA2'<\/span>, <span class=\"st\">'EState_VSA3'<\/span>, <span class=\"st\">'EState_VSA4'<\/span>, <span class=\"st\">'EState_VSA5'<\/span>, <span class=\"st\">'EState_VSA6'<\/span>, <span class=\"st\">'EState_VSA7'<\/span>,<\/span>\n<span id=\"cb10-6\"><a href=\"#cb10-6\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'EState_VSA8'<\/span>, <span class=\"st\">'EState_VSA9'<\/span>, <span class=\"st\">'ExactMolWt'<\/span>, <span class=\"st\">'FractionCSP3'<\/span>, <span class=\"st\">'HallKierAlpha'<\/span>, <span class=\"st\">'HeavyAtomCount'<\/span>, <span class=\"st\">'HeavyAtomMolWt'<\/span>,<\/span>\n<span id=\"cb10-7\"><a href=\"#cb10-7\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'Ipc'<\/span>, <span class=\"st\">'Kappa1'<\/span>, <span class=\"st\">'Kappa2'<\/span>, <span class=\"st\">'Kappa3'<\/span>, <span class=\"st\">'LabuteASA'<\/span>, <span class=\"st\">'MaxAbsEStateIndex'<\/span>, <span class=\"st\">'MaxAbsPartialCharge'<\/span>,<\/span>\n<span id=\"cb10-8\"><a href=\"#cb10-8\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'MaxEStateIndex'<\/span>, <span class=\"st\">'MaxPartialCharge'<\/span>, <span class=\"st\">'MinAbsEStateIndex'<\/span>, <span class=\"st\">'MinAbsPartialCharge'<\/span>,<\/span>\n<span id=\"cb10-9\"><a href=\"#cb10-9\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'MinEStateIndex'<\/span>, <span class=\"st\">'MinPartialCharge'<\/span>, <span class=\"st\">'MolLogP'<\/span>, <span class=\"st\">'MolMR'<\/span>, <span class=\"st\">'MolWt'<\/span>, <span class=\"st\">'NHOHCount'<\/span>, <span class=\"st\">'NOCount'<\/span>,<\/span>\n<span id=\"cb10-10\"><a href=\"#cb10-10\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'NumAliphaticCarbocycles'<\/span>, <span class=\"st\">'NumAliphaticHeterocycles'<\/span>, <span class=\"st\">'NumAliphaticRings'<\/span>,<\/span>\n<span id=\"cb10-11\"><a href=\"#cb10-11\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'NumAromaticCarbocycles'<\/span>, <span class=\"st\">'NumAromaticHeterocycles'<\/span>, <span class=\"st\">'NumAromaticRings'<\/span>, <span class=\"st\">'NumHAcceptors'<\/span>,<\/span>\n<span id=\"cb10-12\"><a href=\"#cb10-12\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'NumHDonors'<\/span>, <span class=\"st\">'NumHeteroatoms'<\/span>, <span class=\"st\">'NumRadicalElectrons'<\/span>, <span class=\"st\">'NumRotatableBonds'<\/span>,<\/span>\n<span id=\"cb10-13\"><a href=\"#cb10-13\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'NumSaturatedCarbocycles'<\/span>, <span class=\"st\">'NumSaturatedHeterocycles'<\/span>, <span class=\"st\">'NumSaturatedRings'<\/span>,<\/span>\n<span id=\"cb10-14\"><a href=\"#cb10-14\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'NumValenceElectrons'<\/span>, <span class=\"st\">'PEOE_VSA1'<\/span>, <span class=\"st\">'PEOE_VSA10'<\/span>, <span class=\"st\">'PEOE_VSA11'<\/span>, <span class=\"st\">'PEOE_VSA12'<\/span>, <span class=\"st\">'PEOE_VSA13'<\/span>,<\/span>\n<span id=\"cb10-15\"><a href=\"#cb10-15\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'PEOE_VSA14'<\/span>, <span class=\"st\">'PEOE_VSA2'<\/span>, <span class=\"st\">'PEOE_VSA3'<\/span>, <span class=\"st\">'PEOE_VSA4'<\/span>, <span class=\"st\">'PEOE_VSA5'<\/span>, <span class=\"st\">'PEOE_VSA6'<\/span>, <span class=\"st\">'PEOE_VSA7'<\/span>,<\/span>\n<span id=\"cb10-16\"><a href=\"#cb10-16\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'PEOE_VSA8'<\/span>, <span class=\"st\">'PEOE_VSA9'<\/span>, <span class=\"st\">'RingCount'<\/span>, <span class=\"st\">'SMR_VSA1'<\/span>, <span class=\"st\">'SMR_VSA10'<\/span>, <span class=\"st\">'SMR_VSA2'<\/span>, <span class=\"st\">'SMR_VSA3'<\/span>,<\/span>\n<span id=\"cb10-17\"><a href=\"#cb10-17\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'SMR_VSA4'<\/span>, <span class=\"st\">'SMR_VSA5'<\/span>, <span class=\"st\">'SMR_VSA6'<\/span>, <span class=\"st\">'SMR_VSA7'<\/span>, <span class=\"st\">'SMR_VSA8'<\/span>, <span class=\"st\">'SMR_VSA9'<\/span>, <span class=\"st\">'SlogP_VSA1'<\/span>,<\/span>\n<span id=\"cb10-18\"><a href=\"#cb10-18\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'SlogP_VSA10'<\/span>, <span class=\"st\">'SlogP_VSA11'<\/span>, <span class=\"st\">'SlogP_VSA12'<\/span>, <span class=\"st\">'SlogP_VSA2'<\/span>, <span class=\"st\">'SlogP_VSA3'<\/span>, <span class=\"st\">'SlogP_VSA4'<\/span>,<\/span>\n<span id=\"cb10-19\"><a href=\"#cb10-19\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'SlogP_VSA5'<\/span>, <span class=\"st\">'SlogP_VSA6'<\/span>, <span class=\"st\">'SlogP_VSA7'<\/span>, <span class=\"st\">'SlogP_VSA8'<\/span>, <span class=\"st\">'SlogP_VSA9'<\/span>, <span class=\"st\">'TPSA'<\/span>, <span class=\"st\">'VSA_EState1'<\/span>,<\/span>\n<span id=\"cb10-20\"><a href=\"#cb10-20\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'VSA_EState10'<\/span>, <span class=\"st\">'VSA_EState2'<\/span>, <span class=\"st\">'VSA_EState3'<\/span>, <span class=\"st\">'VSA_EState4'<\/span>, <span class=\"st\">'VSA_EState5'<\/span>, <span class=\"st\">'VSA_EState6'<\/span>,<\/span>\n<span id=\"cb10-21\"><a href=\"#cb10-21\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'VSA_EState7'<\/span>, <span class=\"st\">'VSA_EState8'<\/span>, <span class=\"st\">'VSA_EState9'<\/span>, <span class=\"st\">'fr_Al_COO'<\/span>, <span class=\"st\">'fr_Al_OH'<\/span>, <span class=\"st\">'fr_Al_OH_noTert'<\/span>,<\/span>\n<span id=\"cb10-22\"><a href=\"#cb10-22\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_ArN'<\/span>, <span class=\"st\">'fr_Ar_COO'<\/span>, <span class=\"st\">'fr_Ar_N'<\/span>, <span class=\"st\">'fr_Ar_NH'<\/span>, <span class=\"st\">'fr_Ar_OH'<\/span>, <span class=\"st\">'fr_COO'<\/span>, <span class=\"st\">'fr_COO2'<\/span>, <span class=\"st\">'fr_C_O'<\/span>,<\/span>\n<span id=\"cb10-23\"><a href=\"#cb10-23\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_C_O_noCOO'<\/span>, <span class=\"st\">'fr_C_S'<\/span>, <span class=\"st\">'fr_HOCCN'<\/span>, <span class=\"st\">'fr_Imine'<\/span>, <span class=\"st\">'fr_NH0'<\/span>, <span class=\"st\">'fr_NH1'<\/span>, <span class=\"st\">'fr_NH2'<\/span>, <span class=\"st\">'fr_N_O'<\/span>,<\/span>\n<span id=\"cb10-24\"><a href=\"#cb10-24\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_Ndealkylation1'<\/span>, <span class=\"st\">'fr_Ndealkylation2'<\/span>, <span class=\"st\">'fr_Nhpyrrole'<\/span>, <span class=\"st\">'fr_SH'<\/span>, <span class=\"st\">'fr_aldehyde'<\/span>,<\/span>\n<span id=\"cb10-25\"><a href=\"#cb10-25\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_alkyl_carbamate'<\/span>, <span class=\"st\">'fr_alkyl_halide'<\/span>, <span class=\"st\">'fr_allylic_oxid'<\/span>, <span class=\"st\">'fr_amide'<\/span>, <span class=\"st\">'fr_amidine'<\/span>,<\/span>\n<span id=\"cb10-26\"><a href=\"#cb10-26\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_aniline'<\/span>, <span class=\"st\">'fr_aryl_methyl'<\/span>, <span class=\"st\">'fr_azide'<\/span>, <span class=\"st\">'fr_azo'<\/span>, <span class=\"st\">'fr_barbitur'<\/span>, <span class=\"st\">'fr_benzene'<\/span>,<\/span>\n<span id=\"cb10-27\"><a href=\"#cb10-27\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_benzodiazepine'<\/span>, <span class=\"st\">'fr_bicyclic'<\/span>, <span class=\"st\">'fr_diazo'<\/span>, <span class=\"st\">'fr_dihydropyridine'<\/span>, <span class=\"st\">'fr_epoxide'<\/span>,<\/span>\n<span id=\"cb10-28\"><a href=\"#cb10-28\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_ester'<\/span>, <span class=\"st\">'fr_ether'<\/span>, <span class=\"st\">'fr_furan'<\/span>, <span class=\"st\">'fr_guanido'<\/span>, <span class=\"st\">'fr_halogen'<\/span>, <span class=\"st\">'fr_hdrzine'<\/span>,<\/span>\n<span id=\"cb10-29\"><a href=\"#cb10-29\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_hdrzone'<\/span>, <span class=\"st\">'fr_imidazole'<\/span>, <span class=\"st\">'fr_imide'<\/span>, <span class=\"st\">'fr_isocyan'<\/span>, <span class=\"st\">'fr_isothiocyan'<\/span>, <span class=\"st\">'fr_ketone'<\/span>,<\/span>\n<span id=\"cb10-30\"><a href=\"#cb10-30\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_ketone_Topliss'<\/span>, <span class=\"st\">'fr_lactam'<\/span>, <span class=\"st\">'fr_lactone'<\/span>, <span class=\"st\">'fr_methoxy'<\/span>, <span class=\"st\">'fr_morpholine'<\/span>, <span class=\"st\">'fr_nitrile'<\/span>,<\/span>\n<span id=\"cb10-31\"><a href=\"#cb10-31\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_nitro'<\/span>, <span class=\"st\">'fr_nitro_arom'<\/span>, <span class=\"st\">'fr_nitro_arom_nonortho'<\/span>, <span class=\"st\">'fr_nitroso'<\/span>, <span class=\"st\">'fr_oxazole'<\/span>,<\/span>\n<span id=\"cb10-32\"><a href=\"#cb10-32\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_oxime'<\/span>, <span class=\"st\">'fr_para_hydroxylation'<\/span>, <span class=\"st\">'fr_phenol'<\/span>, <span class=\"st\">'fr_phenol_noOrthoHbond'<\/span>, <span class=\"st\">'fr_phos_acid'<\/span>,<\/span>\n<span id=\"cb10-33\"><a href=\"#cb10-33\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_phos_ester'<\/span>, <span class=\"st\">'fr_piperdine'<\/span>, <span class=\"st\">'fr_piperzine'<\/span>, <span class=\"st\">'fr_priamide'<\/span>, <span class=\"st\">'fr_prisulfonamd'<\/span>,<\/span>\n<span id=\"cb10-34\"><a href=\"#cb10-34\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_pyridine'<\/span>, <span class=\"st\">'fr_quatN'<\/span>, <span class=\"st\">'fr_sulfide'<\/span>, <span class=\"st\">'fr_sulfonamd'<\/span>, <span class=\"st\">'fr_sulfone'<\/span>, <span class=\"st\">'fr_term_acetylene'<\/span>,<\/span>\n<span id=\"cb10-35\"><a href=\"#cb10-35\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'fr_tetrazole'<\/span>, <span class=\"st\">'fr_thiazole'<\/span>, <span class=\"st\">'fr_thiocyan'<\/span>, <span class=\"st\">'fr_thiophene'<\/span>, <span class=\"st\">'fr_unbrch_alkane'<\/span>, <span class=\"st\">'fr_urea'<\/span>, <span class=\"st\">'qed'<\/span><\/span>\n<span id=\"cb10-36\"><a href=\"#cb10-36\" aria-hidden=\"true\"><\/a>]<\/span><\/code><\/pre><\/div>\n<\/details>\n<\/div>\n<div id=\"1bd292ae-2491-4277-a510-ff9299d0ac4d\" class=\"cell\" data-execution_count=\"15\">\n<details open=\"\" class=\"code-fold\">\n<summary>PDV generation<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb11\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb11-1\"><a href=\"#cb11-1\" aria-hidden=\"true\"><\/a><span class=\"co\"># Physicochemical descriptor vector generation<\/span><\/span>\n<span id=\"cb11-2\"><a href=\"#cb11-2\" aria-hidden=\"true\"><\/a>desc_gen <span class=\"op\">=<\/span> MoleculeDescriptors.MolecularDescriptorCalculator(descriptor_list)<\/span>\n<span id=\"cb11-3\"><a href=\"#cb11-3\" aria-hidden=\"true\"><\/a>pdvs <span class=\"op\">=<\/span> np.array([desc_gen.CalcDescriptors(m) <span class=\"cf\">for<\/span> m <span class=\"kw\">in<\/span> mols])<\/span><\/code><\/pre><\/div>\n<\/details>\n<\/div>\n\n\n<\/main>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>Next, let\u2019s generate ECFP fingerprints with a maximum substructure radius of 2 and bit length of 1024 (ECFP4<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=_%7B1024%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"_{1024}\" class=\"latex\" \/>)<\/p>\n<\/div>\n\n\n\n<div id=\"45c138d3-15b8-4a93-9092-a9d73b8195d5\" class=\"cell\" data-scrolled=\"true\" data-execution_count=\"342\">\n<details open=\"\" class=\"code-fold\">\n<summary>ECFP generation<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb12\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb12-1\"><a href=\"#cb12-1\" aria-hidden=\"true\"><\/a><span class=\"co\"># Morgan fingerprint generation<\/span><\/span>\n<span id=\"cb12-2\"><a href=\"#cb12-2\" aria-hidden=\"true\"><\/a>ecfp_gen <span class=\"op\">=<\/span> GetMorganGenerator(radius<span class=\"op\">=<\/span><span class=\"dv\">2<\/span>, fpSize<span class=\"op\">=<\/span><span class=\"dv\">1024<\/span>, includeChirality<span class=\"op\">=<\/span><span class=\"va\">True<\/span>)<\/span>\n<span id=\"cb12-3\"><a href=\"#cb12-3\" aria-hidden=\"true\"><\/a>fps <span class=\"op\">=<\/span> np.array([ecfp_gen.GetFingerprintAsNumPy(m) <span class=\"cf\">for<\/span> m <span class=\"kw\">in<\/span> mols])<\/span><\/code><\/pre><\/div>\n<\/details>\n<\/div>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>Now, let\u2019s generate FCFP fingerprints with a maximum substructure radius of 2 and bit length of 1024 (FCFP4<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=_%7B1024%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"_{1024}\" class=\"latex\" \/>)<\/p>\n<\/div>\n\n\n\n<div id=\"85ce9e8f-adfa-455a-8d49-3ea6f2f2797d\" class=\"cell\" data-scrolled=\"true\" data-execution_count=\"107\">\n<details open=\"\" class=\"code-fold\">\n<summary>FCFP generation<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb13\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb13-1\"><a href=\"#cb13-1\" aria-hidden=\"true\"><\/a><span class=\"co\"># Morgan functional fingerprint generation<\/span><\/span>\n<span id=\"cb13-2\"><a href=\"#cb13-2\" aria-hidden=\"true\"><\/a>fcfp_gen <span class=\"op\">=<\/span> GetMorganGenerator(radius<span class=\"op\">=<\/span><span class=\"dv\">2<\/span>, fpSize<span class=\"op\">=<\/span><span class=\"dv\">1024<\/span>, includeChirality<span class=\"op\">=<\/span><span class=\"va\">True<\/span>, atomInvariantsGenerator<span class=\"op\">=<\/span>GetMorganFeatureAtomInvGen())<\/span>\n<span id=\"cb13-3\"><a href=\"#cb13-3\" aria-hidden=\"true\"><\/a>func_fps <span class=\"op\">=<\/span> np.array([fcfp_gen.GetFingerprintAsNumPy(m) <span class=\"cf\">for<\/span> m <span class=\"kw\">in<\/span> mols])<\/span><\/code><\/pre><\/div>\n<\/details>\n<\/div>\n<p><br>Due to the small number of dataset samples, we should apply feature selection and dimensionality reduction to each featurisation method on training set data. However, for brevity, this step will be excluded.<\/p>\n<\/section>\n<section id=\"splitting\" class=\"level3\">\n<h3 class=\"anchored\" data-anchor-id=\"splitting\">Splitting<\/h3>\n<p>Now, we need to work out how to split our data. To compare generalisability between methods, we need to minimise data leakage between training and testing splits; this is achievable through implementing a <a href=\"https:\/\/proceedings.mlr.press\/v202\/klarner23a.html\">covariate shift<\/a>, i.e., ensuring the training set and testing set molecules are dissimilar. There are several methods for splitting molecular data into hypothetically dissimilar subsets, including <a href=\"https:\/\/doi.org\/10.1021\/jm9602928\">scaffold splitting<\/a>, <a href=\"https:\/\/doi.org\/10.1021\/ci9803381\">Butina clustering<\/a>, and <a href=\"https:\/\/doi.org\/10.1007\/s11222-007-9033-z\">spectral clustering<\/a>. Here, we look at Butina clustering, which is implementable in RDKit.<\/p>\n<p>First, we must calculate the pairwise <a href=\"https:\/\/doi.org\/10.1021\/ci300261r\">Tanimoto similarity<\/a> (aka Jaccard index) over our benchmark molecules. Tanimoto similarity is a method for calculating the bit overlap between binary vectors and is commonly applied to ECFP fingerprints to measure similarity between molecules. The maximum value of Tanimoto similarity is 1, indicating the bit vectors are identical, and the minimum value is 0, indicating the bit vectors possess no common on-bits. Pairwise calculation of Tanimoto similarity for a small number of molecules is easily implementable in Python with NumPy:<\/p>\n<div id=\"cf54f578-8e55-4710-8a7d-7d9117c1395e\" class=\"cell\" data-execution_count=\"388\">\n<details open=\"\" class=\"code-fold\">\n<summary>Tanimoto similarity<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb14\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb14-1\"><a href=\"#cb14-1\" aria-hidden=\"true\"><\/a><span class=\"kw\">def<\/span> tanimoto_pairwise(fps: np.ndarray) <span class=\"op\">-&gt;<\/span> np.ndarray:<\/span>\n<span id=\"cb14-2\"><a href=\"#cb14-2\" aria-hidden=\"true\"><\/a>    <span class=\"co\">\"\"\"<\/span><\/span>\n<span id=\"cb14-3\"><a href=\"#cb14-3\" aria-hidden=\"true\"><\/a><span class=\"co\">    Compute the pairwise Tanimoto similarity for a set of fingerprints.<\/span><\/span>\n<span id=\"cb14-4\"><a href=\"#cb14-4\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb14-5\"><a href=\"#cb14-5\" aria-hidden=\"true\"><\/a><span class=\"co\">    Parameters:<\/span><\/span>\n<span id=\"cb14-6\"><a href=\"#cb14-6\" aria-hidden=\"true\"><\/a><span class=\"co\">    fps (np.ndarray)<\/span><\/span>\n<span id=\"cb14-7\"><a href=\"#cb14-7\" aria-hidden=\"true\"><\/a><span class=\"co\">        A 2D numpy array where each row represents a binary fingerprint.<\/span><\/span>\n<span id=\"cb14-8\"><a href=\"#cb14-8\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb14-9\"><a href=\"#cb14-9\" aria-hidden=\"true\"><\/a><span class=\"co\">    Returns:<\/span><\/span>\n<span id=\"cb14-10\"><a href=\"#cb14-10\" aria-hidden=\"true\"><\/a><span class=\"co\">    np.ndarray<\/span><\/span>\n<span id=\"cb14-11\"><a href=\"#cb14-11\" aria-hidden=\"true\"><\/a><span class=\"co\">        A 2D numpy array containing the Tanimoto similarity coefficients <\/span><\/span>\n<span id=\"cb14-12\"><a href=\"#cb14-12\" aria-hidden=\"true\"><\/a><span class=\"co\">        for each pair of fingerprints. The element at position (i, j) in<\/span><\/span>\n<span id=\"cb14-13\"><a href=\"#cb14-13\" aria-hidden=\"true\"><\/a><span class=\"co\">        the output array represents the Tanimoto similarity coefficient<\/span><\/span>\n<span id=\"cb14-14\"><a href=\"#cb14-14\" aria-hidden=\"true\"><\/a><span class=\"co\">        between the i-th and j-th fingerprints in the input array.<\/span><\/span>\n<span id=\"cb14-15\"><a href=\"#cb14-15\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb14-16\"><a href=\"#cb14-16\" aria-hidden=\"true\"><\/a><span class=\"co\">    See:<\/span><\/span>\n<span id=\"cb14-17\"><a href=\"#cb14-17\" aria-hidden=\"true\"><\/a><span class=\"co\">        https:\/\/doi.org\/10.1021\/ci300261r<\/span><\/span>\n<span id=\"cb14-18\"><a href=\"#cb14-18\" aria-hidden=\"true\"><\/a><span class=\"co\">        https:\/\/doi.org\/10.1111\/j.1469-8137.1912.tb05611.x<\/span><\/span>\n<span id=\"cb14-19\"><a href=\"#cb14-19\" aria-hidden=\"true\"><\/a><span class=\"co\">    \"\"\"<\/span><\/span>\n<span id=\"cb14-20\"><a href=\"#cb14-20\" aria-hidden=\"true\"><\/a>    c <span class=\"op\">=<\/span> np.matmul(fps, fps.T)<\/span>\n<span id=\"cb14-21\"><a href=\"#cb14-21\" aria-hidden=\"true\"><\/a>    ab <span class=\"op\">=<\/span> np.diag(c)<\/span>\n<span id=\"cb14-22\"><a href=\"#cb14-22\" aria-hidden=\"true\"><\/a>    ab <span class=\"op\">=<\/span> np.add.outer(ab, ab)<\/span>\n<span id=\"cb14-23\"><a href=\"#cb14-23\" aria-hidden=\"true\"><\/a>    <span class=\"cf\">return<\/span> c <span class=\"op\">\/<\/span> (ab <span class=\"op\">-<\/span> c)<\/span><\/code><\/pre><\/div>\n<\/details>\n<\/div>\n<p><br>We can cluster the benchmark molecules based on Tanimoto similarity between molecules. Here, <code>Butina.ClusterData<\/code> takes as input the flattened lower triangle of the similarity distance matrix, a threshold distance value for whether two molecules are considered neighbours, a boolean to show that the input data is distance values, and the number of molecules to cluster:<\/p>\n<div id=\"1eddd5d3-785d-4f9d-b9ca-8efae0e603ff\" class=\"cell\" data-execution_count=\"340\">\n<details open=\"\" class=\"code-fold\">\n<summary>Butina clustering<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb15\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb15-1\"><a href=\"#cb15-1\" aria-hidden=\"true\"><\/a><span class=\"co\"># get pairwise distances based on Tanimoto similarity<\/span><\/span>\n<span id=\"cb15-2\"><a href=\"#cb15-2\" aria-hidden=\"true\"><\/a>dist_matrix <span class=\"op\">=<\/span> <span class=\"dv\">1<\/span> <span class=\"op\">-<\/span> tanimoto_pairwise(fps)<\/span>\n<span id=\"cb15-3\"><a href=\"#cb15-3\" aria-hidden=\"true\"><\/a><span class=\"co\"># flatten lower triangle of symmetric matrix<\/span><\/span>\n<span id=\"cb15-4\"><a href=\"#cb15-4\" aria-hidden=\"true\"><\/a>dist_data <span class=\"op\">=<\/span> dist_matrix[np.tril_indices(<span class=\"bu\">len<\/span>(dist_matrix), <span class=\"op\">-<\/span><span class=\"dv\">1<\/span>)]<\/span>\n<span id=\"cb15-5\"><a href=\"#cb15-5\" aria-hidden=\"true\"><\/a><span class=\"co\"># butina clustering<\/span><\/span>\n<span id=\"cb15-6\"><a href=\"#cb15-6\" aria-hidden=\"true\"><\/a>clusters <span class=\"op\">=<\/span> Butina.ClusterData(<\/span>\n<span id=\"cb15-7\"><a href=\"#cb15-7\" aria-hidden=\"true\"><\/a>    dist_data, distThresh<span class=\"op\">=<\/span><span class=\"fl\">0.65<\/span>, isDistData<span class=\"op\">=<\/span><span class=\"va\">True<\/span>, nPts<span class=\"op\">=<\/span><span class=\"bu\">len<\/span>(dist_matrix)<\/span>\n<span id=\"cb15-8\"><a href=\"#cb15-8\" aria-hidden=\"true\"><\/a>)<\/span>\n<span id=\"cb15-9\"><a href=\"#cb15-9\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb15-10\"><a href=\"#cb15-10\" aria-hidden=\"true\"><\/a><span class=\"co\"># map molecules to clusters in an array<\/span><\/span>\n<span id=\"cb15-11\"><a href=\"#cb15-11\" aria-hidden=\"true\"><\/a>butina_groups <span class=\"op\">=<\/span> np.zeros(<span class=\"bu\">len<\/span>(fps))<\/span>\n<span id=\"cb15-12\"><a href=\"#cb15-12\" aria-hidden=\"true\"><\/a><span class=\"cf\">for<\/span> i <span class=\"kw\">in<\/span> <span class=\"bu\">range<\/span>(<span class=\"bu\">len<\/span>(clusters)):<\/span>\n<span id=\"cb15-13\"><a href=\"#cb15-13\" aria-hidden=\"true\"><\/a>    <span class=\"cf\">for<\/span> j <span class=\"kw\">in<\/span> clusters[i]:<\/span>\n<span id=\"cb15-14\"><a href=\"#cb15-14\" aria-hidden=\"true\"><\/a>        butina_groups[j] <span class=\"op\">=<\/span> i<\/span>\n<span id=\"cb15-15\"><a href=\"#cb15-15\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb15-16\"><a href=\"#cb15-16\" aria-hidden=\"true\"><\/a><span class=\"ss\">f'Number of clusters: <\/span><span class=\"sc\">{<\/span><span class=\"bu\">len<\/span>(clusters)<span class=\"sc\">}<\/span><span class=\"ss\">'<\/span><\/span><\/code><\/pre><\/div>\n<\/details>\n<div class=\"cell-output cell-output-display\" data-execution_count=\"340\">\n<pre><code>'Number of clusters: 126'<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We have molecules clustered by topological similarity. Now, we need a method for splitting our data given the clusters. Enter <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.GroupKFold.html\">GroupKFold<\/a>. Given a set of groups, GroupKFold produces K-Fold cross-validation train and test sets where splits are based on groups (i.e., clusters) rather than individual data points:<\/p>\n<\/section>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/groupKFold.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"388\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/groupKFold.png?resize=625%2C388&#038;ssl=1\" alt=\"\" class=\"wp-image-12064\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/groupKFold.png?resize=1024%2C635&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/groupKFold.png?resize=300%2C186&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/groupKFold.png?resize=768%2C477&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/groupKFold.png?resize=1536%2C953&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/groupKFold.png?resize=624%2C387&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/groupKFold.png?w=1834&amp;ssl=1 1834w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/groupKFold.png?w=1250&amp;ssl=1 1250w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<main class=\"content\" id=\"quarto-document-content\">\n\n\n\n\n<p>We can use GroupKFold to split our molecular data based on Butina clusters. This approach can also be applied to scaffold groupings and spectral clusters. Additionally, random seeding and group shuffling were added to GroupKFold in version 1.6 of sklearn, so cross-validation can be repeated to generate many different train-test splits. For example:<\/p>\n<div id=\"92a06d8d-e058-41d8-8b61-5ef0c3bf8591\" class=\"cell\" data-execution_count=\"317\">\n<details open=\"\" class=\"code-fold\">\n<summary>GroupKFold<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb1\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb1-1\"><a href=\"#cb1-1\" aria-hidden=\"true\"><\/a><span class=\"bu\">print<\/span>(<span class=\"ss\">f'Butina clusters for first 10 molecules: <\/span><span class=\"sc\">{<\/span>butina_groups[:<span class=\"dv\">10<\/span>]<span class=\"sc\">}<\/span><span class=\"ss\">'<\/span>)<\/span>\n<span id=\"cb1-2\"><a href=\"#cb1-2\" aria-hidden=\"true\"><\/a><span class=\"cf\">for<\/span> repeat <span class=\"kw\">in<\/span> <span class=\"bu\">range<\/span>(<span class=\"dv\">2<\/span>):<\/span>\n<span id=\"cb1-3\"><a href=\"#cb1-3\" aria-hidden=\"true\"><\/a>    folds <span class=\"op\">=<\/span> <span class=\"dv\">2<\/span><\/span>\n<span id=\"cb1-4\"><a href=\"#cb1-4\" aria-hidden=\"true\"><\/a>    splitter <span class=\"op\">=<\/span> GroupKFold(n_splits<span class=\"op\">=<\/span>folds, shuffle<span class=\"op\">=<\/span><span class=\"va\">True<\/span>, random_state<span class=\"op\">=<\/span>repeat)<\/span>\n<span id=\"cb1-5\"><a href=\"#cb1-5\" aria-hidden=\"true\"><\/a>    <span class=\"bu\">print<\/span>(<span class=\"ss\">f'Repeatition <\/span><span class=\"sc\">{<\/span>repeat<span class=\"sc\">}<\/span><span class=\"ss\">:'<\/span>)<\/span>\n<span id=\"cb1-6\"><a href=\"#cb1-6\" aria-hidden=\"true\"><\/a>    <span class=\"cf\">for<\/span> num, fold <span class=\"kw\">in<\/span> <span class=\"bu\">enumerate<\/span>(splitter.split(fps[:<span class=\"dv\">10<\/span>], groups<span class=\"op\">=<\/span>butina_groups[:<span class=\"dv\">10<\/span>])):<\/span>\n<span id=\"cb1-7\"><a href=\"#cb1-7\" aria-hidden=\"true\"><\/a>        <span class=\"bu\">print<\/span>(<span class=\"ss\">f'Fold <\/span><span class=\"sc\">{<\/span>num<span class=\"sc\">}<\/span><span class=\"ss\">:'<\/span>)<\/span>\n<span id=\"cb1-8\"><a href=\"#cb1-8\" aria-hidden=\"true\"><\/a>        train_idx, test_idx <span class=\"op\">=<\/span> fold<\/span>\n<span id=\"cb1-9\"><a href=\"#cb1-9\" aria-hidden=\"true\"><\/a>        train_groups, test_groups <span class=\"op\">=<\/span> [], []<\/span>\n<span id=\"cb1-10\"><a href=\"#cb1-10\" aria-hidden=\"true\"><\/a>        <span class=\"cf\">for<\/span> i <span class=\"kw\">in<\/span> train_idx:<\/span>\n<span id=\"cb1-11\"><a href=\"#cb1-11\" aria-hidden=\"true\"><\/a>            train_groups.append(butina_groups[i])<\/span>\n<span id=\"cb1-12\"><a href=\"#cb1-12\" aria-hidden=\"true\"><\/a>        <span class=\"cf\">for<\/span> i <span class=\"kw\">in<\/span> test_idx:<\/span>\n<span id=\"cb1-13\"><a href=\"#cb1-13\" aria-hidden=\"true\"><\/a>            test_groups.append(butina_groups[i])<\/span>\n<span id=\"cb1-14\"><a href=\"#cb1-14\" aria-hidden=\"true\"><\/a>        <span class=\"bu\">print<\/span>(<span class=\"ss\">f'Train groups: <\/span><span class=\"sc\">{<\/span>np<span class=\"sc\">.<\/span>unique(train_groups)<span class=\"sc\">}<\/span><span class=\"ss\">'<\/span>)<\/span>\n<span id=\"cb1-15\"><a href=\"#cb1-15\" aria-hidden=\"true\"><\/a>        <span class=\"bu\">print<\/span>(<span class=\"ss\">f'Test groups: <\/span><span class=\"sc\">{<\/span>np<span class=\"sc\">.<\/span>unique(test_groups)<span class=\"sc\">}<\/span><span class=\"ss\">'<\/span>)<\/span><\/code><\/pre><\/div>\n<\/details>\n<div class=\"cell-output cell-output-stdout\">\n<pre><code>Butina clusters for first 10 molecules: [125 124   0   0   0   0  29   9   9   9]\nRepeatition 0:\nFold 0:\nTrain groups: [124 125]\nTest groups: [ 0  9 29]\nFold 1:\nTrain groups: [ 0  9 29]\nTest groups: [124 125]\nRepeatition 1:\nFold 0:\nTrain groups: [  0 124]\nTest groups: [  9  29 125]\nFold 1:\nTrain groups: [  9  29 125]\nTest groups: [  0 124]<\/code><\/pre>\n<\/div>\n<\/div>\n<p>Therefore, we can generate many cross-validation splits with dissimilar train and test molecules.<\/p>\n\n\n<\/main>\n\n\n\n<section id=\"training-and-testing\" class=\"level3\">\n<h3 class=\"anchored\" data-anchor-id=\"training-and-testing\">Training and Testing<\/h3>\n<p>Finally, we need to train and test models on each of our molecular representations over our Butina splits. Here, I\u2019m using light gradient boosted machines from lightgbm (<a href=\"https:\/\/lightgbm.readthedocs.io\/en\/latest\/index.html\">docs<\/a> and <a href=\"https:\/\/papers.nips.cc\/paper_files\/paper\/2017\/hash\/6449f44a102fde848669bdd9eb6b76fa-Abstract.html\">paper<\/a>) due to their fast training and inference. As this is a regression task, we can use <code>LGBMRegressor<\/code>.\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>For recording performance, we can use two different regresseion metrics; mean absolute error (MAE), and the coefficient of determination (<img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathrm%7BR%7D%5E%7B2%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathrm{R}^{2}\" class=\"latex\" \/>).<\/p>\n<\/div>\n\n\n\n<div id=\"3cdf808a-ad56-4b99-95a3-21718c9e52d4\" class=\"cell\" data-scrolled=\"true\" data-execution_count=\"211\">\n<details open=\"\" class=\"code-fold\">\n<summary>Run training and testing<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb3\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb3-1\"><a href=\"#cb3-1\" aria-hidden=\"true\"><\/a><span class=\"co\"># set number of repeats and cross-validation folds<\/span><\/span>\n<span id=\"cb3-2\"><a href=\"#cb3-2\" aria-hidden=\"true\"><\/a>reps <span class=\"op\">=<\/span> <span class=\"dv\">100<\/span><\/span>\n<span id=\"cb3-3\"><a href=\"#cb3-3\" aria-hidden=\"true\"><\/a>num_folds <span class=\"op\">=<\/span> <span class=\"dv\">5<\/span><\/span>\n<span id=\"cb3-4\"><a href=\"#cb3-4\" aria-hidden=\"true\"><\/a>total_splits <span class=\"op\">=<\/span> reps<span class=\"op\">*<\/span>num_folds<\/span>\n<span id=\"cb3-5\"><a href=\"#cb3-5\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb3-6\"><a href=\"#cb3-6\" aria-hidden=\"true\"><\/a><span class=\"co\"># output results storage<\/span><\/span>\n<span id=\"cb3-7\"><a href=\"#cb3-7\" aria-hidden=\"true\"><\/a>results <span class=\"op\">=<\/span> {<\/span>\n<span id=\"cb3-8\"><a href=\"#cb3-8\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'ecfps'<\/span>: {<\/span>\n<span id=\"cb3-9\"><a href=\"#cb3-9\" aria-hidden=\"true\"><\/a>        <span class=\"st\">'mae'<\/span>: np.zeros(total_splits),<\/span>\n<span id=\"cb3-10\"><a href=\"#cb3-10\" aria-hidden=\"true\"><\/a>        <span class=\"st\">'r2'<\/span>: np.zeros(total_splits),<\/span>\n<span id=\"cb3-11\"><a href=\"#cb3-11\" aria-hidden=\"true\"><\/a>    },<\/span>\n<span id=\"cb3-12\"><a href=\"#cb3-12\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'func_fps'<\/span>: {<\/span>\n<span id=\"cb3-13\"><a href=\"#cb3-13\" aria-hidden=\"true\"><\/a>        <span class=\"st\">'mae'<\/span>: np.zeros(total_splits),<\/span>\n<span id=\"cb3-14\"><a href=\"#cb3-14\" aria-hidden=\"true\"><\/a>        <span class=\"st\">'r2'<\/span>: np.zeros(total_splits),<\/span>\n<span id=\"cb3-15\"><a href=\"#cb3-15\" aria-hidden=\"true\"><\/a>    },<\/span>\n<span id=\"cb3-16\"><a href=\"#cb3-16\" aria-hidden=\"true\"><\/a>    <span class=\"st\">'pdv'<\/span>: {<\/span>\n<span id=\"cb3-17\"><a href=\"#cb3-17\" aria-hidden=\"true\"><\/a>        <span class=\"st\">'mae'<\/span>: np.zeros(total_splits),<\/span>\n<span id=\"cb3-18\"><a href=\"#cb3-18\" aria-hidden=\"true\"><\/a>        <span class=\"st\">'r2'<\/span>: np.zeros(total_splits),<\/span>\n<span id=\"cb3-19\"><a href=\"#cb3-19\" aria-hidden=\"true\"><\/a>    }<\/span>\n<span id=\"cb3-20\"><a href=\"#cb3-20\" aria-hidden=\"true\"><\/a>}<\/span>\n<span id=\"cb3-21\"><a href=\"#cb3-21\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb3-22\"><a href=\"#cb3-22\" aria-hidden=\"true\"><\/a>counter <span class=\"op\">=<\/span> <span class=\"dv\">0<\/span><\/span>\n<span id=\"cb3-23\"><a href=\"#cb3-23\" aria-hidden=\"true\"><\/a>pbar <span class=\"op\">=<\/span> tqdm(total<span class=\"op\">=<\/span>total_splits)<\/span>\n<span id=\"cb3-24\"><a href=\"#cb3-24\" aria-hidden=\"true\"><\/a><span class=\"cf\">for<\/span> i <span class=\"kw\">in<\/span> <span class=\"bu\">range<\/span>(reps):<\/span>\n<span id=\"cb3-25\"><a href=\"#cb3-25\" aria-hidden=\"true\"><\/a>    <span class=\"co\"># loop over repetitions<\/span><\/span>\n<span id=\"cb3-26\"><a href=\"#cb3-26\" aria-hidden=\"true\"><\/a>    <span class=\"co\"># initiate splitter with new random state<\/span><\/span>\n<span id=\"cb3-27\"><a href=\"#cb3-27\" aria-hidden=\"true\"><\/a>    splitter <span class=\"op\">=<\/span> GroupKFold(n_splits<span class=\"op\">=<\/span>num_folds, shuffle<span class=\"op\">=<\/span><span class=\"va\">True<\/span>, random_state<span class=\"op\">=<\/span>i)<\/span>\n<span id=\"cb3-28\"><a href=\"#cb3-28\" aria-hidden=\"true\"><\/a>    <span class=\"co\"># split into folds based on butina clusters<\/span><\/span>\n<span id=\"cb3-29\"><a href=\"#cb3-29\" aria-hidden=\"true\"><\/a>    <span class=\"cf\">for<\/span> fold <span class=\"kw\">in<\/span> splitter.split(fps, groups<span class=\"op\">=<\/span>butina_groups):<\/span>\n<span id=\"cb3-30\"><a href=\"#cb3-30\" aria-hidden=\"true\"><\/a>        <span class=\"co\"># loop over folds<\/span><\/span>\n<span id=\"cb3-31\"><a href=\"#cb3-31\" aria-hidden=\"true\"><\/a>        train_idx, test_idx <span class=\"op\">=<\/span> fold<\/span>\n<span id=\"cb3-32\"><a href=\"#cb3-32\" aria-hidden=\"true\"><\/a>        <span class=\"co\"># get train and test y values<\/span><\/span>\n<span id=\"cb3-33\"><a href=\"#cb3-33\" aria-hidden=\"true\"><\/a>        y_train <span class=\"op\">=<\/span> y_true[train_idx]<\/span>\n<span id=\"cb3-34\"><a href=\"#cb3-34\" aria-hidden=\"true\"><\/a>        y_test <span class=\"op\">=<\/span> y_true[test_idx]<\/span>\n<span id=\"cb3-35\"><a href=\"#cb3-35\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb3-36\"><a href=\"#cb3-36\" aria-hidden=\"true\"><\/a>        <span class=\"co\"># fit LGBM model and predict using Morgan Fingerprints<\/span><\/span>\n<span id=\"cb3-37\"><a href=\"#cb3-37\" aria-hidden=\"true\"><\/a>        X_train <span class=\"op\">=<\/span> fps[train_idx]<\/span>\n<span id=\"cb3-38\"><a href=\"#cb3-38\" aria-hidden=\"true\"><\/a>        X_test <span class=\"op\">=<\/span> fps[test_idx]<\/span>\n<span id=\"cb3-39\"><a href=\"#cb3-39\" aria-hidden=\"true\"><\/a>        model <span class=\"op\">=<\/span> lgb.LGBMRegressor(verbose<span class=\"op\">=-<\/span><span class=\"dv\">1<\/span>)<\/span>\n<span id=\"cb3-40\"><a href=\"#cb3-40\" aria-hidden=\"true\"><\/a>        model.fit(X_train, y_train)<\/span>\n<span id=\"cb3-41\"><a href=\"#cb3-41\" aria-hidden=\"true\"><\/a>        preds <span class=\"op\">=<\/span> model.predict(X_test)<\/span>\n<span id=\"cb3-42\"><a href=\"#cb3-42\" aria-hidden=\"true\"><\/a>        results[<span class=\"st\">'ecfps'<\/span>][<span class=\"st\">'mae'<\/span>][counter] <span class=\"op\">=<\/span> mean_absolute_error(y_test, preds)<\/span>\n<span id=\"cb3-43\"><a href=\"#cb3-43\" aria-hidden=\"true\"><\/a>        results[<span class=\"st\">'ecfps'<\/span>][<span class=\"st\">'r2'<\/span>][counter] <span class=\"op\">=<\/span> r2_score(y_test, preds)<\/span>\n<span id=\"cb3-44\"><a href=\"#cb3-44\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb3-45\"><a href=\"#cb3-45\" aria-hidden=\"true\"><\/a>        <span class=\"co\"># fit LGBM model and predict using functional Morgan Fingerprints<\/span><\/span>\n<span id=\"cb3-46\"><a href=\"#cb3-46\" aria-hidden=\"true\"><\/a>        X_train <span class=\"op\">=<\/span> func_fps[train_idx]<\/span>\n<span id=\"cb3-47\"><a href=\"#cb3-47\" aria-hidden=\"true\"><\/a>        X_test <span class=\"op\">=<\/span> func_fps[test_idx]<\/span>\n<span id=\"cb3-48\"><a href=\"#cb3-48\" aria-hidden=\"true\"><\/a>        model <span class=\"op\">=<\/span> lgb.LGBMRegressor(verbose<span class=\"op\">=-<\/span><span class=\"dv\">1<\/span>)<\/span>\n<span id=\"cb3-49\"><a href=\"#cb3-49\" aria-hidden=\"true\"><\/a>        model.fit(X_train, y_train)<\/span>\n<span id=\"cb3-50\"><a href=\"#cb3-50\" aria-hidden=\"true\"><\/a>        preds <span class=\"op\">=<\/span> model.predict(X_test)<\/span>\n<span id=\"cb3-51\"><a href=\"#cb3-51\" aria-hidden=\"true\"><\/a>        results[<span class=\"st\">'func_fps'<\/span>][<span class=\"st\">'mae'<\/span>][counter] <span class=\"op\">=<\/span> mean_absolute_error(y_test, preds)<\/span>\n<span id=\"cb3-52\"><a href=\"#cb3-52\" aria-hidden=\"true\"><\/a>        results[<span class=\"st\">'func_fps'<\/span>][<span class=\"st\">'r2'<\/span>][counter] <span class=\"op\">=<\/span> r2_score(y_test, preds)<\/span>\n<span id=\"cb3-53\"><a href=\"#cb3-53\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb3-54\"><a href=\"#cb3-54\" aria-hidden=\"true\"><\/a>        <span class=\"co\"># fit LGBM model and predict using physicochemical descriptor vectors<\/span><\/span>\n<span id=\"cb3-55\"><a href=\"#cb3-55\" aria-hidden=\"true\"><\/a>        X_train <span class=\"op\">=<\/span> pdvs[train_idx]<\/span>\n<span id=\"cb3-56\"><a href=\"#cb3-56\" aria-hidden=\"true\"><\/a>        X_test <span class=\"op\">=<\/span> pdvs[test_idx]<\/span>\n<span id=\"cb3-57\"><a href=\"#cb3-57\" aria-hidden=\"true\"><\/a>        model <span class=\"op\">=<\/span> lgb.LGBMRegressor(verbose<span class=\"op\">=-<\/span><span class=\"dv\">1<\/span>)<\/span>\n<span id=\"cb3-58\"><a href=\"#cb3-58\" aria-hidden=\"true\"><\/a>        model.fit(X_train, y_train)<\/span>\n<span id=\"cb3-59\"><a href=\"#cb3-59\" aria-hidden=\"true\"><\/a>        preds <span class=\"op\">=<\/span> model.predict(X_test)<\/span>\n<span id=\"cb3-60\"><a href=\"#cb3-60\" aria-hidden=\"true\"><\/a>        results[<span class=\"st\">'pdv'<\/span>][<span class=\"st\">'mae'<\/span>][counter] <span class=\"op\">=<\/span> mean_absolute_error(y_test, preds)<\/span>\n<span id=\"cb3-61\"><a href=\"#cb3-61\" aria-hidden=\"true\"><\/a>        results[<span class=\"st\">'pdv'<\/span>][<span class=\"st\">'r2'<\/span>][counter] <span class=\"op\">=<\/span> r2_score(y_test, preds)<\/span>\n<span id=\"cb3-62\"><a href=\"#cb3-62\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb3-63\"><a href=\"#cb3-63\" aria-hidden=\"true\"><\/a>        counter <span class=\"op\">+=<\/span> <span class=\"dv\">1<\/span><\/span>\n<span id=\"cb3-64\"><a href=\"#cb3-64\" aria-hidden=\"true\"><\/a>        pbar.update(<span class=\"dv\">1<\/span>)<\/span>\n<span id=\"cb3-65\"><a href=\"#cb3-65\" aria-hidden=\"true\"><\/a>pbar.close()<\/span><\/code><\/pre><\/div>\n<\/details>\n<div class=\"cell-output cell-output-stderr\">\n<pre><code>100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 500\/500 [02:58&lt;00:00,  2.81it\/s]<\/code><\/pre>\n<\/div>\n<\/div>\n<\/section>\n<section id=\"plotting-metrics\" class=\"level3\">\n<h3 class=\"anchored\" data-anchor-id=\"plotting-metrics\">Plotting Metrics<\/h3>\n<p>Now that we have multiple performance metrics for each featurisation approach, rather than using a table, we can produce a scatter plot with a metric on each axis. Plots visualise differences between models and highlight instances where one metric might provide a misleading representation of performance.<\/p>\n<div id=\"3fa43f0e-c139-41b6-84df-5ea441dc1861\" class=\"cell\" data-scrolled=\"true\">\n<details open=\"\" class=\"code-fold\">\n<summary>Scatter plot<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb5\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb5-1\"><a href=\"#cb5-1\" aria-hidden=\"true\"><\/a>plt.figure(figsize<span class=\"op\">=<\/span>(<span class=\"dv\">10<\/span>, <span class=\"dv\">6<\/span>))<\/span>\n<span id=\"cb5-2\"><a href=\"#cb5-2\" aria-hidden=\"true\"><\/a>labels <span class=\"op\">=<\/span> {<span class=\"st\">'ecfps'<\/span>: <span class=\"st\">'ECFP$4_<\/span><span class=\"sc\">{1024}<\/span><span class=\"st\">$'<\/span>,<span class=\"st\">'pdv'<\/span>: <span class=\"st\">'PDV'<\/span>, <span class=\"st\">'func_fps'<\/span>: <span class=\"st\">'FCFP$4_<\/span><span class=\"sc\">{1024}<\/span><span class=\"st\">$'<\/span>}<\/span>\n<span id=\"cb5-3\"><a href=\"#cb5-3\" aria-hidden=\"true\"><\/a><span class=\"cf\">for<\/span> method, scores <span class=\"kw\">in<\/span> results.items():<\/span>\n<span id=\"cb5-4\"><a href=\"#cb5-4\" aria-hidden=\"true\"><\/a>    x <span class=\"op\">=<\/span> scores[<span class=\"st\">'mae'<\/span>].mean()<\/span>\n<span id=\"cb5-5\"><a href=\"#cb5-5\" aria-hidden=\"true\"><\/a>    <span class=\"co\"># mae confidence intervals<\/span><\/span>\n<span id=\"cb5-6\"><a href=\"#cb5-6\" aria-hidden=\"true\"><\/a>    x_ci <span class=\"op\">=<\/span>stats.t.interval(<\/span>\n<span id=\"cb5-7\"><a href=\"#cb5-7\" aria-hidden=\"true\"><\/a>        <span class=\"fl\">0.95<\/span>,<\/span>\n<span id=\"cb5-8\"><a href=\"#cb5-8\" aria-hidden=\"true\"><\/a>        total_splits<span class=\"op\">-<\/span><span class=\"dv\">1<\/span>,<\/span>\n<span id=\"cb5-9\"><a href=\"#cb5-9\" aria-hidden=\"true\"><\/a>        loc<span class=\"op\">=<\/span>x,<\/span>\n<span id=\"cb5-10\"><a href=\"#cb5-10\" aria-hidden=\"true\"><\/a>        scale<span class=\"op\">=<\/span>stats.sem(scores[<span class=\"st\">'mae'<\/span>])<\/span>\n<span id=\"cb5-11\"><a href=\"#cb5-11\" aria-hidden=\"true\"><\/a>    )<\/span>\n<span id=\"cb5-12\"><a href=\"#cb5-12\" aria-hidden=\"true\"><\/a>    y <span class=\"op\">=<\/span> scores[<span class=\"st\">'r2'<\/span>].mean()<\/span>\n<span id=\"cb5-13\"><a href=\"#cb5-13\" aria-hidden=\"true\"><\/a>    <span class=\"co\"># r2 confidence intervals<\/span><\/span>\n<span id=\"cb5-14\"><a href=\"#cb5-14\" aria-hidden=\"true\"><\/a>    y_ci <span class=\"op\">=<\/span>stats.t.interval(<\/span>\n<span id=\"cb5-15\"><a href=\"#cb5-15\" aria-hidden=\"true\"><\/a>        <span class=\"fl\">0.95<\/span>,<\/span>\n<span id=\"cb5-16\"><a href=\"#cb5-16\" aria-hidden=\"true\"><\/a>        total_splits<span class=\"op\">-<\/span><span class=\"dv\">1<\/span>,<\/span>\n<span id=\"cb5-17\"><a href=\"#cb5-17\" aria-hidden=\"true\"><\/a>        loc<span class=\"op\">=<\/span>y,<\/span>\n<span id=\"cb5-18\"><a href=\"#cb5-18\" aria-hidden=\"true\"><\/a>        scale<span class=\"op\">=<\/span>stats.sem(scores[<span class=\"st\">'r2'<\/span>])<\/span>\n<span id=\"cb5-19\"><a href=\"#cb5-19\" aria-hidden=\"true\"><\/a>    )<\/span>\n<span id=\"cb5-20\"><a href=\"#cb5-20\" aria-hidden=\"true\"><\/a>        <span class=\"co\"># plot means<\/span><\/span>\n<span id=\"cb5-21\"><a href=\"#cb5-21\" aria-hidden=\"true\"><\/a>    plt.scatter(x, y, alpha<span class=\"op\">=<\/span><span class=\"fl\">0.5<\/span>, label<span class=\"op\">=<\/span>labels[method])<\/span>\n<span id=\"cb5-22\"><a href=\"#cb5-22\" aria-hidden=\"true\"><\/a>    <span class=\"co\"># plot confidence intervals<\/span><\/span>\n<span id=\"cb5-23\"><a href=\"#cb5-23\" aria-hidden=\"true\"><\/a>    plt.errorbar(x, y, alpha<span class=\"op\">=<\/span><span class=\"fl\">0.5<\/span>, xerr<span class=\"op\">=<\/span>(x_ci[<span class=\"dv\">1<\/span>]<span class=\"op\">-<\/span>x_ci[<span class=\"dv\">0<\/span>])<span class=\"op\">\/<\/span><span class=\"dv\">2<\/span>, yerr<span class=\"op\">=<\/span>(y_ci[<span class=\"dv\">1<\/span>]<span class=\"op\">-<\/span>y_ci[<span class=\"dv\">0<\/span>])<span class=\"op\">\/<\/span><span class=\"dv\">2<\/span>, capsize<span class=\"op\">=<\/span><span class=\"dv\">5<\/span>)<\/span>\n<span id=\"cb5-24\"><a href=\"#cb5-24\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb5-25\"><a href=\"#cb5-25\" aria-hidden=\"true\"><\/a>plt.title(<span class=\"st\">'Coefficent of determination versus mean absolute error on hPPB task'<\/span>)<\/span>\n<span id=\"cb5-26\"><a href=\"#cb5-26\" aria-hidden=\"true\"><\/a>plt.xlabel(<span class=\"st\">'MAE<\/span><span class=\"ch\">\\u2193<\/span><span class=\"st\"> [log$_<\/span><span class=\"sc\">{10}<\/span><span class=\"st\">$(%)]'<\/span>)<\/span>\n<span id=\"cb5-27\"><a href=\"#cb5-27\" aria-hidden=\"true\"><\/a>plt.ylabel(<span class=\"st\">'R$^<\/span><span class=\"sc\">{2}<\/span><span class=\"ch\">\\u2191<\/span><span class=\"st\">$'<\/span>)<\/span>\n<span id=\"cb5-28\"><a href=\"#cb5-28\" aria-hidden=\"true\"><\/a>plt.legend()<\/span>\n<span id=\"cb5-29\"><a href=\"#cb5-29\" aria-hidden=\"true\"><\/a>plt.grid(<span class=\"va\">True<\/span>)<\/span>\n<span id=\"cb5-30\"><a href=\"#cb5-30\" aria-hidden=\"true\"><\/a>plt.show()<\/span><\/code><\/pre><\/div>\n<\/details>\n<\/div>\n<\/section>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/scatter.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"375\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/scatter.png?resize=625%2C375&#038;ssl=1\" alt=\"\" class=\"wp-image-12067\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/scatter.png?w=1000&amp;ssl=1 1000w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/scatter.png?resize=300%2C180&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/scatter.png?resize=768%2C461&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/scatter.png?resize=624%2C374&amp;ssl=1 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<figure class=\"figure\">\n<figcaption>Coefficent of determination versus mean absolute error for ECFP4 (blue), FCFP4 (yellow), and PDVs (green) on Human plasma protein binding (log(%)) task. Values shown are the mean of each metric over 100 repeated 5-fold cross-validations with GroupKFold and Butina clustering at a threshold of 0.65. Error bars are confidence intervals at threshold of 0.95.<\/figcaption>\n<\/figure><br>\n\n\n\n<main class=\"content\" id=\"quarto-document-content\">\n\n\n\n\n<p>The legibility of a results section is an aspect of research that is often overlooked. Simple plots, such as the one above, help to communicate differences between approaches. Aggregated tables with results for many models on multiple benchmarks are much harder to read by comparison.<\/p>\n<section id=\"statistical-tests\" class=\"level3\">\n<h3 class=\"anchored\" data-anchor-id=\"statistical-tests\">Statistical Tests<\/h3>\n<p>So, we\u2019ve visualised our results, but are the approaches significantly different in performance? We can use <em>t<\/em>-tests, Tukey\u2019s HSD tests, and analysis of variation (ANOVA) to evaluate performance differences between methods.<\/p>\n<section id=\"t-tests\" class=\"level4\">\n<h4 class=\"anchored\" data-anchor-id=\"t-tests\">t-tests<\/h4>\n<p>A <em>t<\/em>-test is a method for testing whether there is a significant difference between groups of data. Here, we use <a href=\"https:\/\/en.wikipedia.org\/wiki\/Student%27s_t-test#Dependent_t-test_for_paired_samples\">paired <em>t<\/em>-tests<\/a>, as test splits were the same for each model. Our null hypothesis is that there is no difference in metrics between LGBMs with PDVs and ECFPs. A <em>t<\/em>-test can be run in Python using the <code>stats<\/code> module in the <a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.ttest_rel.html\">scipy<\/a> library:<\/p>\n<div id=\"58099bc2-9be4-4adf-995d-8e4f031798e1\" class=\"cell\" data-execution_count=\"123\">\n<details open=\"\" class=\"code-fold\">\n<summary>MAE t-test<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb1\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb1-1\"><a href=\"#cb1-1\" aria-hidden=\"true\"><\/a><span class=\"co\"># MAE t-test between physicochemical descriptor vectors and extended-connectivity fingerprints<\/span><\/span>\n<span id=\"cb1-2\"><a href=\"#cb1-2\" aria-hidden=\"true\"><\/a>mae_ttest <span class=\"op\">=<\/span> stats.ttest_rel(results[<span class=\"st\">'pdv'<\/span>][<span class=\"st\">'mae'<\/span>], results[<span class=\"st\">'ecfps'<\/span>][<span class=\"st\">'mae'<\/span>])<\/span>\n<span id=\"cb1-3\"><a href=\"#cb1-3\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb1-4\"><a href=\"#cb1-4\" aria-hidden=\"true\"><\/a><span class=\"ss\">f't-statistic = <\/span><span class=\"sc\">{<\/span>mae_ttest[<span class=\"dv\">0<\/span>]<span class=\"sc\">:.3}<\/span><span class=\"ss\">; p-value = <\/span><span class=\"sc\">{<\/span>mae_ttest[<span class=\"dv\">1<\/span>]<span class=\"sc\">:.3}<\/span><span class=\"ss\">; degrees of <\/span><span class=\"er\"><\/span><span class=\"ss\">freedom = <\/span><span class=\"sc\">{mae_ttest.df}<\/span><span class=\"st\">'<\/span><\/span><\/code><\/pre><\/div>\n<\/details>\n<div class=\"cell-output cell-output-display\" data-execution_count=\"123\">\n<pre><code>'t-statistic = -28.5; p-value = 1.16e-106; degrees of freedom = 499'<\/code><\/pre>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>How do we interpret this result? As the t-statistic is <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%3C+0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&lt; 0\" class=\"latex\" \/>, the mean MAE value for PDVs is lower than that of ECFPs. The p-value is the probability of observing the t-statistic given that the null hypothesis (the hypothesis that there\u2019s no difference between the MAE for PDVs and ECFPs) is true. The typical p-value threshold for rejecting the null hypothesis is <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=0.05&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"0.05\" class=\"latex\" \/> for the <em>t<\/em>-test above, the difference between the MAE for PDVs and ECFPs is significant as the p-value is <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%3C+0.05&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&lt; 0.05\" class=\"latex\" \/>. The <a href=\"https:\/\/en.wikipedia.org\/wiki\/Degrees_of_freedom_(statistics)\">degrees of freedom<\/a> for a <em>t<\/em>-test is the number of samples minus one; the number of samples here is the number of train-test splits used.<\/p>\n<\/div>\n\n\n\n<div id=\"13ae546b-512c-40c1-a838-c7d12c85925f\" class=\"cell\" data-execution_count=\"124\">\n<details open=\"\" class=\"code-fold\">\n<summary>R2 t-test<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb3\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb3-1\"><a href=\"#cb3-1\" aria-hidden=\"true\"><\/a><span class=\"co\"># R2 t-test between physicochemical descriptor vectors and extended-connectivity fingerprints<\/span><\/span>\n<span id=\"cb3-2\"><a href=\"#cb3-2\" aria-hidden=\"true\"><\/a>r2_ttest <span class=\"op\">=<\/span> stats.ttest_rel(results[<span class=\"st\">'pdv'<\/span>][<span class=\"st\">'r2'<\/span>], results[<span class=\"st\">'ecfps'<\/span>][<span class=\"st\">'r2'<\/span>])<\/span>\n<span id=\"cb3-3\"><a href=\"#cb3-3\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb3-4\"><a href=\"#cb3-4\" aria-hidden=\"true\"><\/a><span class=\"ss\">f't-statistic = <\/span><span class=\"sc\">{<\/span>r2_ttest[<span class=\"dv\">0<\/span>]<span class=\"sc\">:.3}<\/span><span class=\"ss\">; p-value = <\/span><span class=\"sc\">{<\/span>r2_ttest[<span class=\"dv\">1<\/span>]<span class=\"sc\">:.3}<\/span><span class=\"ss\">; degrees of freedom = <\/span><span class=\"sc\">{<\/span>r2_ttest<span class=\"sc\">.<\/span>df<span class=\"sc\">}<\/span><span class=\"ss\">'<\/span><\/span><\/code><\/pre><\/div>\n<\/details>\n<div class=\"cell-output cell-output-display\" data-execution_count=\"124\">\n<pre><code>'t-statistic = 22.9; p-value = 5.83e-80; degrees of freedom = 499'<\/code><\/pre>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>For <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathrm%7BR%7D%5E%7B2%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathrm{R}^{2}\" class=\"latex\" \/>, the t-statistic is <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%3E+0&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&gt; 0\" class=\"latex\" \/>, therefore, the mean <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%5Cmathrm%7BR%7D%5E%7B2%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&#92;mathrm{R}^{2}\" class=\"latex\" \/> value for PDVs is greater than that of ECFPs; again, this difference is significant as the p-value is <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=%3C+0.05&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"&lt; 0.05\" class=\"latex\" \/>.<\/p>\n<\/div>\n\n\n\n<\/section>\n<section id=\"tukey-hsd\" class=\"level4\">\n<h4 class=\"anchored\" data-anchor-id=\"tukey-hsd\">Tukey HSD<\/h4>\n<p>So, a paired <em>t<\/em>-test is perfect for comparing two different models. For handling three or more models, we can compare methods in a pairwise manner using <a href=\"https:\/\/en.wikipedia.org\/wiki\/Tukey's_range_test#The_test_statistic\">Tukey\u2019s HSD test<\/a>, which is another method for evaluating difference between means. This can be done with <a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.tukey_hsd.html\">scipy<\/a>:<\/p>\n<div id=\"d251c4f3-fefd-4d97-9549-df61286beaa5\" class=\"cell\" data-execution_count=\"205\">\n<details open=\"\" class=\"code-fold\">\n<summary>MAE Tukey HSD<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb5\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb5-1\"><a href=\"#cb5-1\" aria-hidden=\"true\"><\/a><span class=\"co\"># MAE Tukey HSD<\/span><\/span>\n<span id=\"cb5-2\"><a href=\"#cb5-2\" aria-hidden=\"true\"><\/a>cols <span class=\"op\">=<\/span> [<span class=\"st\">'PDV'<\/span>, <span class=\"st\">'ECFP4'<\/span>, <span class=\"st\">'FCFP4'<\/span>]<\/span>\n<span id=\"cb5-3\"><a href=\"#cb5-3\" aria-hidden=\"true\"><\/a>indexes <span class=\"op\">=<\/span> cols<\/span>\n<span id=\"cb5-4\"><a href=\"#cb5-4\" aria-hidden=\"true\"><\/a>cols <span class=\"op\">=<\/span> pd.MultiIndex.from_product([cols, [<span class=\"st\">'statistic'<\/span>, <span class=\"st\">'p-value'<\/span>]])<\/span>\n<span id=\"cb5-5\"><a href=\"#cb5-5\" aria-hidden=\"true\"><\/a>mae_tukey <span class=\"op\">=<\/span> stats.tukey_hsd(results[<span class=\"st\">'pdv'<\/span>][<span class=\"st\">'mae'<\/span>], results[<span class=\"st\">'ecfps'<\/span>][<span class=\"st\">'mae'<\/span>], results[<span class=\"st\">'func_fps'<\/span>][<span class=\"st\">'mae'<\/span>])<\/span>\n<span id=\"cb5-6\"><a href=\"#cb5-6\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb5-7\"><a href=\"#cb5-7\" aria-hidden=\"true\"><\/a>tukey_mae_stats <span class=\"op\">=<\/span> pd.DataFrame(columns<span class=\"op\">=<\/span>cols, index<span class=\"op\">=<\/span>indexes)<\/span>\n<span id=\"cb5-8\"><a href=\"#cb5-8\" aria-hidden=\"true\"><\/a>tukey_mae_stats.loc[:, pd.IndexSlice[:, <span class=\"st\">'statistic'<\/span>]] <span class=\"op\">=<\/span> mae_tukey.statistic<\/span>\n<span id=\"cb5-9\"><a href=\"#cb5-9\" aria-hidden=\"true\"><\/a>tukey_mae_stats.loc[:, pd.IndexSlice[:, <span class=\"st\">'p-value'<\/span>]] <span class=\"op\">=<\/span> mae_tukey.pvalue<\/span>\n<span id=\"cb5-10\"><a href=\"#cb5-10\" aria-hidden=\"true\"><\/a>tukey_mae_stats<\/span><\/code><\/pre><\/div>\n<\/details>\n<div class=\"cell-output cell-output-display\" data-execution_count=\"205\">\n<div>\n\n\n<table class=\"dataframe caption-top table table-sm table-striped small\" data-quarto-postprocess=\"true\" data-border=\"1\">\n<thead>\n<tr class=\"header\">\n<th data-quarto-table-cell-role=\"th\"><\/th>\n<th colspan=\"2\" data-quarto-table-cell-role=\"th\" data-halign=\"left\">PDV<\/th>\n<th colspan=\"2\" data-quarto-table-cell-role=\"th\" data-halign=\"left\">ECFP4<\/th>\n<th colspan=\"2\" data-quarto-table-cell-role=\"th\" data-halign=\"left\">FCFP4<\/th>\n<\/tr>\n<tr class=\"odd\">\n<th data-quarto-table-cell-role=\"th\"><\/th>\n<th data-quarto-table-cell-role=\"th\">statistic<\/th>\n<th data-quarto-table-cell-role=\"th\">p-value<\/th>\n<th data-quarto-table-cell-role=\"th\">statistic<\/th>\n<th data-quarto-table-cell-role=\"th\">p-value<\/th>\n<th data-quarto-table-cell-role=\"th\">statistic<\/th>\n<th data-quarto-table-cell-role=\"th\">p-value<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr class=\"odd\">\n<td data-quarto-table-cell-role=\"th\">PDV<\/td>\n<td>0.0<\/td>\n<td>1.0<\/td>\n<td>-0.103552<\/td>\n<td>0.0<\/td>\n<td>-0.07745<\/td>\n<td>0.0<\/td>\n<\/tr>\n<tr class=\"even\">\n<td data-quarto-table-cell-role=\"th\">ECFP4<\/td>\n<td>0.103552<\/td>\n<td>0.0<\/td>\n<td>0.0<\/td>\n<td>1.0<\/td>\n<td>0.026102<\/td>\n<td>0.0<\/td>\n<\/tr>\n<tr class=\"odd\">\n<td data-quarto-table-cell-role=\"th\">FCFP4<\/td>\n<td>0.07745<\/td>\n<td>0.0<\/td>\n<td>-0.026102<\/td>\n<td>0.0<\/td>\n<td>0.0<\/td>\n<td>1.0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-jetpack-markdown\"><p>Note that for these functions (<em>t<\/em>-tests, Tukey HSD, etc.) there are instances where the output p-value is 0; this is likely due to precision limits for float values. When the output p-value = 0 occurs, I suggest reporting the p-value as p-value &lt; 0.001.<\/p>\n<\/div>\n\n\n\n<p>To visualise this kind of pairwise comparison, <a href=\"https:\/\/chemrxiv.org\/engage\/chemrxiv\/article-details\/672a91bd7be152b1d01a926b\">Ash <em>et al.<\/em> (2024)<\/a> suggest using a Multiple Comparisons Similarity (MCSim) plots (i.e., heatmaps):<\/p>\n<div id=\"853a0c0d-70b2-4ba8-a914-d7824008df25\" class=\"cell\" data-scrolled=\"true\">\n<details open=\"\" class=\"code-fold\">\n<summary>MCSim plot<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb6\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb6-1\"><a href=\"#cb6-1\" aria-hidden=\"true\"><\/a>plt.figure(figsize<span class=\"op\">=<\/span>(<span class=\"dv\">6<\/span>, <span class=\"dv\">5<\/span>))<\/span>\n<span id=\"cb6-2\"><a href=\"#cb6-2\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb6-3\"><a href=\"#cb6-3\" aria-hidden=\"true\"><\/a>sns.heatmap(<\/span>\n<span id=\"cb6-4\"><a href=\"#cb6-4\" aria-hidden=\"true\"><\/a>    mae_tukey.statistic, <span class=\"co\"># values to plot<\/span><\/span>\n<span id=\"cb6-5\"><a href=\"#cb6-5\" aria-hidden=\"true\"><\/a>    annot<span class=\"op\">=<\/span><span class=\"va\">True<\/span>,<\/span>\n<span id=\"cb6-6\"><a href=\"#cb6-6\" aria-hidden=\"true\"><\/a>    cmap<span class=\"op\">=<\/span><span class=\"st\">'coolwarm'<\/span>,<\/span>\n<span id=\"cb6-7\"><a href=\"#cb6-7\" aria-hidden=\"true\"><\/a>    center<span class=\"op\">=<\/span><span class=\"dv\">0<\/span>,<\/span>\n<span id=\"cb6-8\"><a href=\"#cb6-8\" aria-hidden=\"true\"><\/a>    fmt<span class=\"op\">=<\/span><span class=\"st\">\".2f\"<\/span>,<\/span>\n<span id=\"cb6-9\"><a href=\"#cb6-9\" aria-hidden=\"true\"><\/a>    cbar_kws<span class=\"op\">=<\/span>{<span class=\"st\">'label'<\/span>: <span class=\"st\">'Effect size'<\/span>},<\/span>\n<span id=\"cb6-10\"><a href=\"#cb6-10\" aria-hidden=\"true\"><\/a>    xticklabels<span class=\"op\">=<\/span>indexes, yticklabels<span class=\"op\">=<\/span>indexes, <span class=\"co\"># label rows and cols with models<\/span><\/span>\n<span id=\"cb6-11\"><a href=\"#cb6-11\" aria-hidden=\"true\"><\/a>    vmin<span class=\"op\">=-<\/span><span class=\"fl\">0.2<\/span>, vmax<span class=\"op\">=<\/span><span class=\"fl\">0.2<\/span>, <span class=\"co\"># colour bar range<\/span><\/span>\n<span id=\"cb6-12\"><a href=\"#cb6-12\" aria-hidden=\"true\"><\/a>)<\/span>\n<span id=\"cb6-13\"><a href=\"#cb6-13\" aria-hidden=\"true\"><\/a>plt.show()<\/span><\/code><\/pre><\/div>\n<\/details>\n<\/div>\n<\/section>\n<\/section>\n\n<\/main>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/mcsim.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"600\" height=\"500\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/mcsim.png?resize=600%2C500&#038;ssl=1\" alt=\"\" class=\"wp-image-12068\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/mcsim.png?w=600&amp;ssl=1 600w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/mcsim.png?resize=300%2C250&amp;ssl=1 300w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><\/figure>\n\n\n\n<figure class=\"figure\">\n<figcaption>MCSim plot of MAE effect size between featurisation approaches. Each column and each row are a featurisation method. A greater effect size indicates a greater difference between methods.<\/figcaption>\n<\/figure><br>\n\n\n\n<title>ANOVA<\/title>\n\n<p>Analysis of variation (<a href=\"https:\/\/en.wikipedia.org\/wiki\/Analysis_of_variance\">ANOVA<\/a>) is a method for testing the difference in means between groups of data. Like <em>t<\/em>-tests and Tukey HSD, <a href=\"https:\/\/en.wikipedia.org\/wiki\/One-way_analysis_of_variance\">one-way ANOVA<\/a> can be performed using the <a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.stats.mstats.f_oneway.html\">scipy<\/a> library in Python:<\/p>\n<div id=\"7e0eeb8a-f8e7-4639-8bfc-bd361f82ddae\" class=\"cell\" data-execution_count=\"314\">\n<details open=\"\" class=\"code-fold\">\n<summary>One-way ANOVA<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb1\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb1-1\"><a href=\"#cb1-1\" aria-hidden=\"true\"><\/a>cols <span class=\"op\">=<\/span> pd.MultiIndex.from_product([indexes, [<span class=\"st\">'F-statistic'<\/span>, <span class=\"st\">'p-value'<\/span>]])<\/span>\n<span id=\"cb1-2\"><a href=\"#cb1-2\" aria-hidden=\"true\"><\/a>anova_df <span class=\"op\">=<\/span> pd.DataFrame(columns<span class=\"op\">=<\/span>cols, index<span class=\"op\">=<\/span>indexes)<\/span>\n<span id=\"cb1-3\"><a href=\"#cb1-3\" aria-hidden=\"true\"><\/a>name_map <span class=\"op\">=<\/span> {<span class=\"st\">'pdv'<\/span>:<span class=\"st\">'PDV'<\/span>, <span class=\"st\">'ecfps'<\/span>:<span class=\"st\">'ECFP4'<\/span>, <span class=\"st\">'func_fps'<\/span>: <span class=\"st\">'FCFP4'<\/span>}<\/span>\n<span id=\"cb1-4\"><a href=\"#cb1-4\" aria-hidden=\"true\"><\/a><span class=\"cf\">for<\/span> i <span class=\"kw\">in<\/span> results:<\/span>\n<span id=\"cb1-5\"><a href=\"#cb1-5\" aria-hidden=\"true\"><\/a>    <span class=\"cf\">for<\/span> j <span class=\"kw\">in<\/span> results:<\/span>\n<span id=\"cb1-6\"><a href=\"#cb1-6\" aria-hidden=\"true\"><\/a>        anova <span class=\"op\">=<\/span> stats.f_oneway(results[i][<span class=\"st\">'r2'<\/span>], results[j][<span class=\"st\">'r2'<\/span>])<\/span>\n<span id=\"cb1-7\"><a href=\"#cb1-7\" aria-hidden=\"true\"><\/a>        anova_df.loc[name_map[j], (name_map[i], <span class=\"st\">'F-statistic'<\/span>)] <span class=\"op\">=<\/span> anova.statistic<\/span>\n<span id=\"cb1-8\"><a href=\"#cb1-8\" aria-hidden=\"true\"><\/a>        anova_df.loc[name_map[j], (name_map[i], <span class=\"st\">'p-value'<\/span>)] <span class=\"op\">=<\/span> anova.pvalue<\/span>\n<span id=\"cb1-9\"><a href=\"#cb1-9\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb1-10\"><a href=\"#cb1-10\" aria-hidden=\"true\"><\/a>anova_df<\/span><\/code><\/pre><\/div>\n<\/details>\n<div class=\"cell-output cell-output-display\" data-execution_count=\"314\">\n<div>\n\n\n<table class=\"dataframe caption-top table table-sm table-striped small\" data-quarto-postprocess=\"true\" data-border=\"1\">\n<thead>\n<tr class=\"header\">\n<th data-quarto-table-cell-role=\"th\"><\/th>\n<th colspan=\"2\" data-quarto-table-cell-role=\"th\" data-halign=\"left\">PDV<\/th>\n<th colspan=\"2\" data-quarto-table-cell-role=\"th\" data-halign=\"left\">ECFP4<\/th>\n<th colspan=\"2\" data-quarto-table-cell-role=\"th\" data-halign=\"left\">FCFP4<\/th>\n<\/tr>\n<tr class=\"odd\">\n<th data-quarto-table-cell-role=\"th\"><\/th>\n<th data-quarto-table-cell-role=\"th\">F-statistic<\/th>\n<th data-quarto-table-cell-role=\"th\">p-value<\/th>\n<th data-quarto-table-cell-role=\"th\">F-statistic<\/th>\n<th data-quarto-table-cell-role=\"th\">p-value<\/th>\n<th data-quarto-table-cell-role=\"th\">F-statistic<\/th>\n<th data-quarto-table-cell-role=\"th\">p-value<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr class=\"odd\">\n<td data-quarto-table-cell-role=\"th\">PDV<\/td>\n<td>0.0<\/td>\n<td>1.0<\/td>\n<td>278.151946<\/td>\n<td>0.0<\/td>\n<td>174.92284<\/td>\n<td>0.0<\/td>\n<\/tr>\n<tr class=\"even\">\n<td data-quarto-table-cell-role=\"th\">ECFP4<\/td>\n<td>278.151946<\/td>\n<td>0.0<\/td>\n<td>0.0<\/td>\n<td>1.0<\/td>\n<td>6.762882<\/td>\n<td>0.009445<\/td>\n<\/tr>\n<tr class=\"odd\">\n<td data-quarto-table-cell-role=\"th\">FCFP4<\/td>\n<td>174.92284<\/td>\n<td>0.0<\/td>\n<td>6.762882<\/td>\n<td>0.009445<\/td>\n<td>0.0<\/td>\n<td>1.0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n\n<\/div>\n<\/div>\n<\/div>\n<p>Now we have another indicator of the difference between means, the F-statistic, and p-values to determine whether the differences are significant. Let\u2019s heatmap the p-values of these results:<\/p>\n<div id=\"292a9ac5-a8c4-42b6-b3c8-e0e15a56afc3\" class=\"cell\" data-scrolled=\"true\">\n<details open=\"\" class=\"code-fold\">\n<summary>Heatmap of p-values<\/summary>\n<div class=\"sourceCode cell-code\" id=\"cb2\"><pre class=\"sourceCode python code-with-copy\"><code class=\"sourceCode python\"><span id=\"cb2-1\"><a href=\"#cb2-1\" aria-hidden=\"true\"><\/a>pvals <span class=\"op\">=<\/span> anova_df.loc[:, pd.IndexSlice[:, <span class=\"st\">'p-value'<\/span>]].astype(<span class=\"bu\">float<\/span>).to_numpy()<\/span>\n<span id=\"cb2-2\"><a href=\"#cb2-2\" aria-hidden=\"true\"><\/a>arr <span class=\"op\">=<\/span> np.zeros(pvals.shape, dtype<span class=\"op\">=<\/span><span class=\"bu\">int<\/span>)<\/span>\n<span id=\"cb2-3\"><a href=\"#cb2-3\" aria-hidden=\"true\"><\/a>arr[np.where(pvals <span class=\"op\">&lt;<\/span> <span class=\"fl\">0.05<\/span>)] <span class=\"op\">+=<\/span><span class=\"dv\">1<\/span><\/span>\n<span id=\"cb2-4\"><a href=\"#cb2-4\" aria-hidden=\"true\"><\/a>arr[np.where(pvals <span class=\"op\">&lt;<\/span> <span class=\"fl\">0.01<\/span>)] <span class=\"op\">+=<\/span><span class=\"dv\">1<\/span><\/span>\n<span id=\"cb2-5\"><a href=\"#cb2-5\" aria-hidden=\"true\"><\/a>arr[np.where(pvals <span class=\"op\">&lt;<\/span> <span class=\"fl\">0.001<\/span>)] <span class=\"op\">+=<\/span><span class=\"dv\">1<\/span><\/span>\n<span id=\"cb2-6\"><a href=\"#cb2-6\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb2-7\"><a href=\"#cb2-7\" aria-hidden=\"true\"><\/a>plt.figure(figsize<span class=\"op\">=<\/span>(<span class=\"dv\">6<\/span>, <span class=\"dv\">5<\/span>))<\/span>\n<span id=\"cb2-8\"><a href=\"#cb2-8\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb2-9\"><a href=\"#cb2-9\" aria-hidden=\"true\"><\/a>colors <span class=\"op\">=<\/span> [<span class=\"st\">\"#d3d3d3\"<\/span>, <span class=\"st\">\"#90ee90\"<\/span>, <span class=\"st\">\"#008000\"<\/span>, <span class=\"st\">\"#004d00\"<\/span>]  <span class=\"co\"># Custom colors for 0, 1, 2, 3<\/span><\/span>\n<span id=\"cb2-10\"><a href=\"#cb2-10\" aria-hidden=\"true\"><\/a>cmap <span class=\"op\">=<\/span> ListedColormap(colors)<\/span>\n<span id=\"cb2-11\"><a href=\"#cb2-11\" aria-hidden=\"true\"><\/a><\/span>\n<span id=\"cb2-12\"><a href=\"#cb2-12\" aria-hidden=\"true\"><\/a>ax <span class=\"op\">=<\/span> sns.heatmap(<\/span>\n<span id=\"cb2-13\"><a href=\"#cb2-13\" aria-hidden=\"true\"><\/a>    arr, <span class=\"co\"># values to plot<\/span><\/span>\n<span id=\"cb2-14\"><a href=\"#cb2-14\" aria-hidden=\"true\"><\/a>    cmap<span class=\"op\">=<\/span>cmap,<\/span>\n<span id=\"cb2-15\"><a href=\"#cb2-15\" aria-hidden=\"true\"><\/a>    xticklabels<span class=\"op\">=<\/span>indexes, yticklabels<span class=\"op\">=<\/span>indexes, <span class=\"co\"># label rows and cols with models<\/span><\/span>\n<span id=\"cb2-16\"><a href=\"#cb2-16\" aria-hidden=\"true\"><\/a>    cbar <span class=\"op\">=<\/span> <span class=\"va\">True<\/span>, square<span class=\"op\">=<\/span><span class=\"va\">True<\/span>, fmt <span class=\"op\">=<\/span> <span class=\"st\">'d'<\/span>,<\/span>\n<span id=\"cb2-17\"><a href=\"#cb2-17\" aria-hidden=\"true\"><\/a>    vmin<span class=\"op\">=-<\/span><span class=\"fl\">0.5<\/span>, vmax<span class=\"op\">=<\/span><span class=\"fl\">3.5<\/span><\/span>\n<span id=\"cb2-18\"><a href=\"#cb2-18\" aria-hidden=\"true\"><\/a>)<\/span>\n<span id=\"cb2-19\"><a href=\"#cb2-19\" aria-hidden=\"true\"><\/a>colorbar <span class=\"op\">=<\/span> ax.collections[<span class=\"dv\">0<\/span>].colorbar<\/span>\n<span id=\"cb2-20\"><a href=\"#cb2-20\" aria-hidden=\"true\"><\/a>colorbar.set_ticks([<span class=\"dv\">0<\/span>, <span class=\"dv\">1<\/span>, <span class=\"dv\">2<\/span>, <span class=\"dv\">3<\/span>,])<\/span>\n<span id=\"cb2-21\"><a href=\"#cb2-21\" aria-hidden=\"true\"><\/a>colorbar.set_ticklabels([<span class=\"st\">'p-value &gt; 0.05'<\/span>, <span class=\"st\">'p-value &lt; 0.05'<\/span>, <span class=\"st\">'p-value &lt; 0.01'<\/span>, <span class=\"st\">'p-value &lt; 0.001'<\/span>])<\/span>\n<span id=\"cb2-22\"><a href=\"#cb2-22\" aria-hidden=\"true\"><\/a>plt.show()<\/span><\/code><\/pre><\/div>\n<\/details>\n<\/div>\n\n<\/main>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/heatmap-1.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"521\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/heatmap-1.png?resize=625%2C521&#038;ssl=1\" alt=\"\" class=\"wp-image-12070\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/heatmap-1.png?resize=1024%2C853&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/heatmap-1.png?resize=300%2C250&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/heatmap-1.png?resize=768%2C640&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/heatmap-1.png?resize=1536%2C1280&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/heatmap-1.png?resize=624%2C520&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/heatmap-1.png?w=1800&amp;ssl=1 1800w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2024\/12\/heatmap-1.png?w=1250&amp;ssl=1 1250w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<figure class=\"figure\">\n<figcaption>Heatmap plot of ANOVA p-values between coefficient of determination scores for three featurisation approaches. Each column and each row are a featurisation method.<\/figcaption>\n<\/figure><br>\n\n\n\n<main class=\"content\" id=\"quarto-document-content\">\n\n\n<p>For each pair of methods, there is a significant difference between the mean values for the coefficient of determination at a p-value threshold of 0.05.<\/p>\n<section id=\"final-thoughts\" class=\"level3\">\n<h3 class=\"anchored\" data-anchor-id=\"final-thoughts\">Final thoughts<\/h3>\n<p>Hopefully this exploration of covariate shift splitting, visualisation, and statistical testing is helpful! <a href=\"https:\/\/chemrxiv.org\/engage\/chemrxiv\/article-details\/672a91bd7be152b1d01a926b\">Ash <em>et al.<\/em> (2024)<\/a> go into greater detail and suggest more potential approaches for validating and communicating differences between models, so do give that a read too. Ultimately, all of these suggestions are in service of clarity; if another method for displaying your results communicates a desired point well, include it. After all, a results section is only as good as its figures.<\/p>\n<\/section>\n\n<\/main>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Author Sam Money-Kyrle Introduction An epidemic is sweeping through cheminformatics (and machine learning) research: ugly results tables. These tables are typically bloated with metrics (such as regression and classification metrics next to each other), vastly differing tasks, erratic bold text, and many models. As a consequence, results become difficult to analyse and interpret. Additionally, [&hellip;]<\/p>\n","protected":false},"author":118,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[633,29,361,621,632,296,189,202,227,201],"tags":[130,172,152],"ppma_author":[760],"class_list":["post-12062","post","type-post","status-publish","format-standard","hentry","category-ai","category-code","category-data-science","category-data-visualization","category-deep-learning","category-hints-and-tips","category-machine-learning","category-proteins","category-python-code","category-small-molecules","tag-cheminformatics","tag-machine-learning","tag-python"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":760,"user_id":118,"is_guest":0,"slug":"sam","display_name":"Sam Money-Kyrle","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/784870e2ed5304f12f11366dad56cbf1c0b9aa63bd80021ae235ba5f30536a12?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12062","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/118"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=12062"}],"version-history":[{"count":6,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12062\/revisions"}],"predecessor-version":[{"id":12144,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12062\/revisions\/12144"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=12062"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=12062"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=12062"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=12062"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}