{"id":12920,"date":"2025-08-27T14:18:36","date_gmt":"2025-08-27T13:18:36","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=12920"},"modified":"2025-08-28T17:27:37","modified_gmt":"2025-08-28T16:27:37","slug":"how-reliable-are-affinity-datasets-in-practice","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2025\/08\/how-reliable-are-affinity-datasets-in-practice\/","title":{"rendered":"How reliable are affinity datasets in practice?"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">The Data Bottleneck in AI-Powered Drug Discovery<\/h2>\n\n\n\n<p>The pharmaceutical industry is undergoing a profound transformation, driven by the promise of Artificial Intelligence (AI) and Machine Learning (ML). These technologies offer the potential to escape the industry&#8217;s persistent challenges of high costs, protracted development timelines, and staggering failure rates. From accelerating the identification of novel biological targets to optimizing the properties of lead compounds, AI is poised to enhance the precision and efficiency of drug discovery at nearly every stage<\/p>\n\n\n\n<p>Yet, this revolutionary potential is constrained by a fundamental dependency. The power of modern AI, particularly the deep learning (DL) models that excel at complex pattern recognition, is directly proportional to the volume, diversity, and quality of the data they are trained on. This creates a critical bottleneck: the high-quality experimental data required to train these models\u2014specifically, the protein-ligand binding affinity values that quantify the strength of an interaction\u2014are notoriously scarce, expensive to generate, and often of inconsistent quality or locked within proprietary databases.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">Quantifying the Noise Floor<\/h2>\n\n\n\n<p>A foundational 2013 study by Kramer <em>et al.<\/em> provided a rigorous look at the statistical cost of this common practice. By meticulously filtering the ChEMBL database, they found that over 90% of the raw IC<sub>50<\/sub> data was unsuitable for a direct, head-to-head comparison. The main reasons for this were:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Inconsistent &amp; unit conversion errors (\u00b5M, M, mM)<\/li>\n\n\n\n<li>Un-bounded values (&lt;, &gt;, ~)<\/li>\n\n\n\n<li>Removal of singletons (non-repeated measures)<\/li>\n\n\n\n<li>Measures were from the same lab (author overlap_<\/li>\n<\/ul>\n\n\n\n<p>From the remaining 10% of data (~660K), they further filtered the data according to the ChEMBL confidence score (0-9), filtering scores below 4. A score of four or more indicates a biochemical measurement and a confidence score below four indicates a cellular measurement. Their final dataset consisted of ~20K repeated measurements for ~10K protein-ligand pairs (~3.6K unique proteins).<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-17-at-13.28.46-5.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"423\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-17-at-13.28.46-5.png?resize=625%2C423&#038;ssl=1\" alt=\"\" class=\"wp-image-12949\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-17-at-13.28.46-5.png?w=640&amp;ssl=1 640w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-17-at-13.28.46-5.png?resize=300%2C203&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-17-at-13.28.46-5.png?resize=624%2C422&amp;ssl=1 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>Perhaps the most critical part of their analysis was investigating <em>why<\/em> some measurements were so wildly different. They manually inspected pairs of data with large discrepancies (\u0394pIC<sub>50<\/sub> &gt; 2.5, or a &gt;300-fold difference) and discovered that these were almost always caused by simple <strong>annotation errors<\/strong>. &nbsp;<\/p>\n\n\n\n<p>The most common mistakes they found were:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Unit Errors:<\/strong> A value measured in micromolar (\u00b5M) was incorrectly recorded as nanomolar (nM), creating an instant 1000-fold error. This was the most frequent issue.<\/li>\n\n\n\n<li><strong>Receptor Subtype Errors:<\/strong> A compound&#8217;s activity against a general receptor family (<em>e.g.<\/em>, &#8220;dopamine receptor&#8221;) was mixed with data for a specific subtype (<em>e.g.<\/em>, &#8220;D2 receptor&#8221;).<\/li>\n\n\n\n<li><strong>Assay Type Errors:<\/strong> Data from a cellular assay was mislabeled as a biochemical assay.<\/li>\n\n\n\n<li><strong>Other Errors:<\/strong> Mistakes in stereochemistry, incorrect target assignments, or simply extracting the wrong value from the original publication. <\/li>\n<\/ul>\n\n\n\n<p>These findings show that the most extreme outliers in public databases are often not due to subtle differences in experimental conditions, but to straightforward human or data-entry errors. This is why the correlation between all data pairs was low (R\u00b2 = 0.40), but improved significantly (R\u00b2 = 0.53) just by removing these error-prone pairs. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-17-at-13.28.30-1.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"600\" height=\"482\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-17-at-13.28.30-1.png?resize=600%2C482&#038;ssl=1\" alt=\"\" class=\"wp-image-12948\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-17-at-13.28.30-1.png?w=600&amp;ssl=1 600w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-17-at-13.28.30-1.png?resize=300%2C241&amp;ssl=1 300w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>However, these values are still strikingly low when considering what state-of-the-art affinity models claim, with R\u00b2 values up to 0.86 in some cases.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Mixing IC<sub>50<\/sub> and K<sub>i<\/sub> is fine, just add a bias<\/h2>\n\n\n\n<p>Finally, the study offered a practical piece of advice. They found that K<sub>i<\/sub> data (a measure of binding affinity) is about 25% less noisy than IC<sub>50<\/sub> data. However, since there is so much more IC<sub>50<\/sub> data available, it&#8217;s often desirable to combine them. They showed that you can add K<sub>i<\/sub> data to an IC<sub>50<\/sub> dataset without making it worse, as long as you apply a simple correction: subtracting 0.35 log units from the pK<sub>i<\/sub> values first.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-17-at-13.28.21-1.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"600\" height=\"482\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-17-at-13.28.21-1.png?resize=600%2C482&#038;ssl=1\" alt=\"\" class=\"wp-image-12950\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-17-at-13.28.21-1.png?w=600&amp;ssl=1 600w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/08\/Screenshot-2025-08-17-at-13.28.21-1.png?resize=300%2C241&amp;ssl=1 300w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><\/a><\/figure>\n<\/div>\n\n\n<h2 class=\"wp-block-heading\">Closing Remarks<\/h2>\n\n\n\n<p>So the question to you reader is, <strong>have you ever checked or considered how noisy your affinity data is for your ML model training?<\/strong> <\/p>\n\n\n\n<p>The reality is a lot of labelled data follows &#8220;whatever I want protocols&#8221;, and these can be different according to lab the data was collected from. The truth is, using empirical datasets is a bit like ordering french fries at McDonalds. Sometimes they are perfectly crispy and salty, and sometimes they are soggy and tasteless.<\/p>\n\n\n\n<p>We haven&#8217;t even touched on IC<sub>50<\/sub> curve fitting, which is another massive source of inconsistency, so imagine how many things can go wrong. If you told a medicinal chemist that you got a set of hits and these were not triplicate-confirmed, they would not trust the readings. So why should we trust IC<sub>50<\/sub>\/K<sub>i<\/sub>\/K<sub>d<\/sub> values that are taken at face value? Can we please start our affinity labels with IC<sub>50<\/sub> curves?<\/p>\n\n\n\n<p><strong>TL;DR<\/strong>: <strong>Be careful to trust your data<\/strong> and always verify the collection and mixing of data from different labs. You could well be sabotaging your ML project!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Data Bottleneck in AI-Powered Drug Discovery The pharmaceutical industry is undergoing a profound transformation, driven by the promise of Artificial Intelligence (AI) and Machine Learning (ML). These technologies offer the potential to escape the industry&#8217;s persistent challenges of high costs, protracted development timelines, and staggering failure rates. From accelerating the identification of novel biological [&hellip;]<\/p>\n","protected":false},"author":130,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[633,123,341,849,189,291,202,201],"tags":[],"ppma_author":[843],"class_list":["post-12920","post","type-post","status-publish","format-standard","hentry","category-ai","category-commentary","category-databases","category-drug-discovery","category-machine-learning","category-protein-ligand-docking","category-proteins","category-small-molecules"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":843,"user_id":130,"is_guest":0,"slug":"alvaro","display_name":"Alvaro Prat","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/3ab9b3b760b1c3cf7c2073ed08830accad651310244cc781046f70c9af6e1805?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12920","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/130"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=12920"}],"version-history":[{"count":3,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12920\/revisions"}],"predecessor-version":[{"id":12951,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12920\/revisions\/12951"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=12920"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=12920"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=12920"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=12920"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}