{"id":14021,"date":"2026-05-26T16:54:27","date_gmt":"2026-05-26T15:54:27","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=14021"},"modified":"2026-05-26T16:54:28","modified_gmt":"2026-05-26T15:54:28","slug":"how-unusual-is-your-generated-molecule-let-the-ccdc-tell-you","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2026\/05\/how-unusual-is-your-generated-molecule-let-the-ccdc-tell-you\/","title":{"rendered":"How Unusual Is Your Generated Molecule? Let The CCDC Tell You"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">In this post I&#8217;ll walk through how to set up the CCDC Python API and use the CSD Geometry Analyser to evaluate the geometric quality of molecules from three representative structure-based <em>de novo<\/em> design models. I&#8217;ve put together a small GitHub repo with the full analysis code where we look at bond lengths, angles, torsions, and ring conformations across the three methods, and compare these against their PoseBusters validity scores to see what each metric is really capturing.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p class=\"wp-block-paragraph\">First, <a href=\"https:\/\/www.ccdc.cam.ac.uk\/account\/login\/register\" data-type=\"link\" data-id=\"https:\/\/www.ccdc.cam.ac.uk\/account\/login\/register\">register<\/a>. I used my university email just in case.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Once you register, login and activate your account. You&#8217;ll want to visit the License Portal which you can find under My Account. Here, enter the activation key along with the customer number, both of which you can obtain from your supervisor. Keep a note of these, you&#8217;ll need them in the next several steps.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now, the next part is kind of weird but you have to go and <em><a href=\"https:\/\/www.ccdc.cam.ac.uk\/support-and-resources\/csdsdownloads\/\">request<\/a><\/em> a download link, despite being registered, It is not automatically available to you. The download links expire within 24 hours, so be quick.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You will receive the email with a link to download <em>everything<\/em> (see image below). Here you can choose what you want to download. It is best to click the recommended &#8216;small download&#8217; for your machine. Keep in mind, even the small download takes up a good chunk of space on your machine. Once you start the download, you can untick and tick what you&#8217;d like to download. Personally, I did not download the GUIs and only kept the python API and datasets, but it&#8217;s up to you. In my Master&#8217;s days, I quite enjoyed using Hermes.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-1.png?ssl=1\"><img decoding=\"async\" width=\"592\" height=\"462\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-1.png?fit=592%2C462&amp;ssl=1\" alt=\"\" class=\"wp-image-14222\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-1.png?w=592&amp;ssl=1 592w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-1.png?resize=300%2C234&amp;ssl=1 300w\" sizes=\"auto, (max-width: 592px) 100vw, 592px\" \/><\/a><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This youtube <a href=\"https:\/\/www.youtube.com\/watch?v=nXfNhO0eV_o i\" data-type=\"page\" data-id=\"14277\">video<\/a> is helpful if you get confused with the download process.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Environment Setup <\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The python API comes with a ready to go environment and depending on where you saved it during the download will resemble something like this :<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">[...]\/CCDC\/ccdc-software\/csd-python-api\/run_csd_python_api<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If you click the <code>run_csd_python_api<\/code>, a terminal will open with the environment ready to go but it&#8217;s not my favourite to work this way so I use python scripts in cursor or vscode and just activate the env before running them. Be aware you don&#8217;t get RDKit in this env and although this is upsetting, we can just have a split terminal open with each env.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"bash\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\n# Activate the CSD Python API conda environment\n# Don't try to use Conda activate - it won't work\nsource ~\/CCDC\/ccdc-software\/csd-python-api\/miniconda\/bin\/activate<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Geometry Analyser Snippet<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The Geometry Analyser works by comparing molecular features against distributions observed in the CSD. The Cambridge Structural Database (CSD) contains over 1.3 million small molecule crystal structures and forms the knowledge base that Mogul draws on when assessing whether a given bond length, angle, torsion, or ring conformation is chemically reasonable. You can find more on the conformer API <a href=\"https:\/\/downloads.ccdc.cam.ac.uk\/documentation\/API\/modules\/conformer_api.html\" data-type=\"link\" data-id=\"https:\/\/downloads.ccdc.cam.ac.uk\/documentation\/API\/modules\/conformer_api.html\">here<\/a> and some nice examples <a href=\"https:\/\/downloads.ccdc.cam.ac.uk\/documentation\/API\/cookbook_examples\/geometry_analyser_examples.html\" data-type=\"link\" data-id=\"https:\/\/downloads.ccdc.cam.ac.uk\/documentation\/API\/cookbook_examples\/geometry_analyser_examples.html\">here<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Okay, now for some short example code where set up the geometry analyser and analyse a single molecule.<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">\nfrom ccdc.io import MoleculeReader\nfrom ccdc.conformer import GeometryAnalyser\n\n# simple enough, build the geometry analyser instance\ndef build_analyser():\n    analyser = GeometryAnalyser()\n    analyser.settings.generalisation = False       # MUCH faster \u2014 skips generalised CSD searches but results less accurate and may give more unusual fractions\n    analyser.settings.organometallic_filter = 'Organic'  # organic mols only\n    return analyser\n\n# --- try it on a single molecule ---\nanalyser = build_analyser()\n\n# load the first molecule from an SDF file\nmol = MoleculeReader('data\/drugflow_extracted\/1a2g-A-rec-4jmv-1ly-lig-tt-min-0-pocket10_1a2g-A-rec-4jmv-1ly-lig-tt-min-0.sdf')[0]\n\n# standardise bond types and add hydrogens \u2014 recommended before any Mogul analysis\n# this ensures the fragment matching against the CSD is as accurate as possible\nmol.assign_bond_types(which='unknown')      # assign any untyped bonds\nmol.standardise_aromatic_bonds()            # normalise aromatic bond representations\nmol.standardise_delocalised_bonds()         # normalise delocalised systems e.g. carboxylates\nmol.add_hydrogens()                         # Mogul needs explicit hydrogens for fragment matching\n\n# run the full geometry analysis \u2014 returns the molecule with analysis attributes attached\nanalysed = analyser.analyse_molecule(mol)\n\n# for each geometry feature, filter to those where Mogul found enough CSD hits\n# to make a confident judgement (default thresholds predefined by mogul but may be changed)\n# then compute the fraction flagged as unusual\nfor feature_name, features in [\n    ('Bond lengths',  analysed.analysed_bonds),\n    ('Bond angles',   analysed.analysed_angles),\n    ('Torsions',      analysed.analysed_torsions),\n    ('Rings',         analysed.analysed_rings),\n]:\n    # exclude features without enough CSD data \u2014 not enough hits to call unusual\n    valid = [f for f in features if f.enough_hits]\n\n    if not valid:\n        print(f\"{feature_name}: no features with enough CSD hits\")\n        continue\n\n    # fraction of valid features flagged as geometrically unusual by Mogul\n    unusual_fraction = sum(1 for f in valid if f.unusual) \/ len(valid)\n    print(f\"{feature_name}: {unusual_fraction:.2f} unusual ({len(valid)} features checked)\")<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Short Analysis<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">PoseBusters validity evaluates structural validity of generated molecules and is an extremely useful benchmark in our field. Here I complement this with CSD Mogul geometry analysis, which adds a chemistry-aware layer by comparing bond lengths, angles, torsions and ring conformations against experimentally determined crystal structures. Together these two metrics paint a more complete picture of generated molecule quality.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For this short analysis I used one generated molecule per pocket across 100 pockets from the CrossDocked2020 test set, giving 100 molecules per method. The three methods, Pocket2Mol (autoregressive), TargetDiff (diffusion), and DrugFlow (flow matching), were chosen because their architectures represent a natural progression in generative modelling for structure-based drug design. What&#8217;s interesting is that the unusual fractions decrease across the architectural progression, with DrugFlow consistently showing the lowest values across all four geometry features.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-6.png?ssl=1\"><img decoding=\"async\" width=\"989\" height=\"590\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-6.png?fit=989%2C590&amp;ssl=1\" alt=\"\" class=\"wp-image-14309\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-6.png?w=989&amp;ssl=1 989w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-6.png?resize=300%2C179&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-6.png?resize=768%2C458&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-6.png?resize=624%2C372&amp;ssl=1 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Now, let&#8217;s see if these results are reflected in PoseBusters pass rates:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><td><strong>Method<\/strong><\/td><td><strong>PoseBusters Pass Rate<\/strong> %<\/td><\/tr><tr><td>DrugFlow<\/td><td>90%<\/td><\/tr><tr><td>Pocket2Mol<\/td><td>100%<\/td><\/tr><tr><td>TargetDiff<\/td><td>45%<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">We won&#8217;t draw any hard conclusions since this is only a fun blog post. It is worth noting that we are comparing two different things. PoseBusters applies hard pass\/fail thresholds to individual bond lengths and angles, whereas the CCDC Geometry Analyser takes a knowledge-based approach, flagging features that fall outside distributions observed in the CSD, making it a softer signal rather than a strict failure criterion. I think it would be interesting to look at how the minimised poses may do from each method, alongside some minimised and raw crystal structure poses as well!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I provide the scripts and the data used to generate the plot above on a Github repo: <a href=\"https:\/\/github.com\/SanazKaz\/molgeom-eval\">https:\/\/github.com\/SanazKaz\/molgeom-eval<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The data for all models was obtained from the DrugFlow Github which linked to: <a href=\"https:\/\/zenodo.org\/records\/14919171\">https:\/\/zenodo.org\/records\/14919171<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Happy CCDC&#8217;ing!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post I&#8217;ll walk through how to set up the CCDC Python API and use the CSD Geometry Analyser to evaluate the geometric quality of molecules from three representative structure-based de novo design models. I&#8217;ve put together a small GitHub repo with the full analysis code where we look at bond lengths, angles, torsions, [&hellip;]<\/p>\n","protected":false},"author":128,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[633,341,228,227,265],"tags":[943,944,945,858],"ppma_author":[802],"class_list":["post-14021","post","type-post","status-publish","format-standard","hentry","category-ai","category-databases","category-protein-structure","category-python-code","category-x-ray-crystallography","tag-ccdc","tag-csd","tag-mogul","tag-posebusters"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":802,"user_id":128,"is_guest":0,"slug":"sanaz","display_name":"Sanaz Kazeminia","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/d7ee2fbf2cb52aaa1856ad4e395733a6a561811dad16c2ae3b60b3b8d5f6c68c?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/14021","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/128"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=14021"}],"version-history":[{"count":4,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/14021\/revisions"}],"predecessor-version":[{"id":14321,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/14021\/revisions\/14321"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=14021"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=14021"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=14021"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=14021"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}