{"id":12538,"date":"2025-04-29T09:26:08","date_gmt":"2025-04-29T08:26:08","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=12538"},"modified":"2025-05-29T00:31:41","modified_gmt":"2025-05-28T23:31:41","slug":"featurisation-is-key-one-version-change-that-halved-diffdocks-performance","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2025\/04\/featurisation-is-key-one-version-change-that-halved-diffdocks-performance\/","title":{"rendered":"Featurisation is Key: One Version Change that Halved DiffDock&#8217;s Performance"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\"><strong>1. Introduction&nbsp;<\/strong><\/h2>\n\n\n\n<p>Molecular docking with graph neural networks works by representing the molecules as featurized graphs. In DiffDock, each ligand becomes a graph of atoms (nodes) and bonds (edges), with features assigned to&nbsp;every atom using chemical properties such as atom type, implicit valence and formal charge.&nbsp;<br>&nbsp;<br>We recently discovered that a change in RDKit versions significantly reduces performance on the PoseBusters benchmark, due to changes in the \u201cimplicit valence\u201d feauture. This post walks through:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>How DiffDock featurises ligands&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>What happened when we upgraded RDKit 2022.03.3 \u2192 2025.03.1&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Why training with zero-only features and testing on non-zero features is so bad&nbsp;<\/li>\n<\/ul>\n\n\n\n<p><strong><em>TL:DR: <\/em><\/strong><strong>Use the dependencies listed in the environment.yml file, especially in the case of DiffDock, or your performance could half! <\/strong>&nbsp;<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\"><strong>2. Graph Representation in DiffDock&nbsp;<\/strong><\/h2>\n\n\n\n<p>DiffDock turns a ligand into input for a graph neural net by&nbsp;<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li>Loading the ligand from an SDF file via RDKit.&nbsp;<\/li>\n<\/ol>\n\n\n\n<ol start=\"2\" class=\"wp-block-list\">\n<li>Stripping all hydrogens to keep heavy atoms only.&nbsp;<\/li>\n<\/ol>\n\n\n\n<ol start=\"3\" class=\"wp-block-list\">\n<li>Featurising each atom into a 16-dimensional vector:&nbsp;<\/li>\n<\/ol>\n\n\n\n<p>0: Atomic number&nbsp;<\/p>\n\n\n\n<p>1: Chirality tag&nbsp;<\/p>\n\n\n\n<p>2: Total bond degree&nbsp;<\/p>\n\n\n\n<p>3: Formal charge&nbsp;<\/p>\n\n\n\n<p>4: Implicit valence&nbsp;<\/p>\n\n\n\n<p>5: Number of implicit H\u2019s&nbsp;<\/p>\n\n\n\n<p>6: Radical electrons&nbsp;<\/p>\n\n\n\n<p>7: Hybridisation&nbsp;<\/p>\n\n\n\n<p>8: Aromatic flag&nbsp;<\/p>\n\n\n\n<p>9-15: Ring-membership flags (rings of size 3\u20138)&nbsp;<\/p>\n\n\n\n<ol start=\"4\" class=\"wp-block-list\">\n<li>Building a PyG HeteroData containing node features and bond-edges.&nbsp;<\/li>\n<\/ol>\n\n\n\n<ol start=\"5\" class=\"wp-block-list\">\n<li>Randomizing position, orientation and torsion angles before inputting to the model for inference.&nbsp;<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>3. PoseBusters Benchmark &amp; RDKit Version Bump&nbsp;<\/strong><\/h2>\n\n\n\n<p>Using the supplied evaluation.py script which docks into whichever protein chains the ground truth is bound to, we evaluated on the 428-complex PoseBusters set using two different RDKit versions:&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td>RDKit version&nbsp;<\/td><td>&lt;2 RMSD success rate\u00a0<\/td><\/tr><tr><td>2022.03.3&nbsp;<\/td><td>50.89\u2009%&nbsp;<\/td><\/tr><tr><td>2025.03.1&nbsp;<\/td><td>23.72\u2009%&nbsp;<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>With no changes other than the RDKit version, the success rate dropped by over half.&nbsp;<\/p>\n\n\n\n<p>Having checked the evaluation and conformer-generation steps, I took a more detailed look at the preprocessed data being fed into the model using the different RDKit versions. Everything was identical except implicit valence:&nbsp;<br>&#8211; RDKit 2022.03.3: implicit valence = 0 for every atom&nbsp;<br>&#8211; RDKit 2025.03.1: implicit valence ranges from 0-3 &nbsp;<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Relevant Changes to RDKit\u2019s GetImplicitValence()&nbsp;<\/strong><\/h4>\n\n\n\n<p>Between 2022.03.3 and 2025.03.1, RDKit was refactored so that implicit hydrogen counts are recomputed rather than permanently zeroed out after stripping explicit H\u2019s.&nbsp;<\/p>\n\n\n\n<p>Old 2022.03.3 behavior:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>RemoveHs() deletes all explicit hydrogens and sets each heavy atom\u2019s internal flag df_noImplicit = true, keeping only a heavy atom representation.&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Once df_noImplicit is set, asking for implicit valence always returns 0, even if you re-run sanitization.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>New 2025.03.1 behavior:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>RemoveHs() deletes explicit hydrogens but does not flag df_noImplicit = true, allowing recomputation of implicit valence.&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Sanitization calculates implicit valence = allowed valence \u2013 sum of explicit bonds&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>GetImplicitValence() returns the correct implicit valence, even after stripping all H\u2019s.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>These changes mean:&nbsp;<br><em>Old <\/em>(2022.03.3): RemoveHs() \u2192 df_noImplicit \u2192 GetImplicitValence() always 0&nbsp;<br><em>New <\/em>(2025.03.1): RemoveHs() (flag untouched) \u2192 sanitization recomputes \u2192 GetImplicitValence() returns the correct implicit-H count&nbsp;<br>&nbsp;<br>Because DiffDock was only ever trained on zeros at that index, suddenly inputting non-zero values at inference caused this collapse in performance.&nbsp;<\/p>\n\n\n\n<p>We force-zeroed that column and recovered peformance under the new RDkit, validating that this caused the drop in performance:&nbsp;<\/p>\n\n\n\n<p>&#8211; implicit_valence = atom.GetImplicitValence()&nbsp;<\/p>\n\n\n\n<p>+ implicit_valence = 0&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td>RDKit build&nbsp;<\/td><td>Success rate&nbsp;<\/td><\/tr><tr><td>2022.03.3 baseline&nbsp;<\/td><td>50.89\u2009%&nbsp;<\/td><\/tr><tr><td>2025.03.1 unpatched&nbsp;<\/td><td>23.72\u2009%&nbsp;<\/td><\/tr><tr><td>2025.03.1 patched&nbsp;<\/td><td>50.26\u2009%&nbsp;<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>4. Why Zero-Trained \u2192 Non-Zero-Tested Is So Bad&nbsp;&nbsp;<\/strong><\/h2>\n\n\n\n<p>The weight, <em>w, <\/em>controls how much \u201cimplicit valence\u201d influences the network. There\u2019s also a built-in bias <em>b<\/em>&nbsp;and an activation function \u03d5. Together they compute:&nbsp;<br>&nbsp;<br>&nbsp;&nbsp;&nbsp; output = \u03d5 (w\u2009v + b)&nbsp;<\/p>\n\n\n\n<p>Where <em>v<\/em> is the implicit valence feature.&nbsp;<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">What Happens When You Train on Only Zeros?&nbsp;<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Implicit valence <em>(v)<\/em> = 0<em>&nbsp;<\/em>every time you train.&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Since the input is always zero, there\u2019s no signal telling <em>w <\/em>to move. In the absence of an explicit mechanism for the weights to become zero, such as&nbsp;weight decay, they will remain non-zero.&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Effectively, the model learns that the implicit valence feature column doesn\u2019t matter, and <em>w<\/em> remains at the random starting point.&nbsp;<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">What happens at test time?&nbsp;<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The implicit valence feature (<em>v<\/em>) might be 1, 2, or 3 now.&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The unchanged, random <em>w<\/em>&nbsp;multiplies this new <em>v<\/em>, producing unpredictable activations \u03d5(w<sub>random<\/sub>&nbsp;v+b).&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>These activations continue through downstream layers to the final prediction output.&nbsp;<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>5. Conclusion&nbsp;<\/strong><\/h2>\n\n\n\n<p>Featurisation is very important &#8211; in the case of DiffDock, one library tweak changed one feature column and halved the performance! The fix was easy once it was found, but remember:&nbsp;<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li>Featurization is key&nbsp;&nbsp;<\/li>\n<\/ol>\n\n\n\n<ol start=\"2\" class=\"wp-block-list\">\n<li>Particularly in the case of DiffDock, use the listed dependency versions!&nbsp;<\/li>\n<\/ol>\n\n\n\n<ol start=\"3\" class=\"wp-block-list\">\n<li>If you see a sudden large change in performance, it might be worth checking the package versions and the features&#8230;&nbsp;<\/li>\n<\/ol>\n\n\n\n<p>Happy docking!&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>1. Introduction&nbsp; Molecular docking with graph neural networks works by representing the molecules as featurized graphs. In DiffDock, each ligand becomes a graph of atoms (nodes) and bonds (edges), with features assigned to&nbsp;every atom using chemical properties such as atom type, implicit valence and formal charge.&nbsp;&nbsp;We recently discovered that a change in RDKit versions significantly [&hellip;]<\/p>\n","protected":false},"author":126,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[633,189,274,291,201],"tags":[],"ppma_author":[790],"class_list":["post-12538","post","type-post","status-publish","format-standard","hentry","category-ai","category-machine-learning","category-molecular-recognition","category-protein-ligand-docking","category-small-molecules"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":790,"user_id":126,"is_guest":0,"slug":"jamesb","display_name":"James Broster","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/df390c53770be6a0afc152da99e17025226f7300979c8b5e54021ddeb87971e4?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12538","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/126"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=12538"}],"version-history":[{"count":6,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12538\/revisions"}],"predecessor-version":[{"id":12654,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/12538\/revisions\/12654"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=12538"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=12538"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=12538"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=12538"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}