In this post I’ll walk through how to set up the CCDC Python API and use the CSD Geometry Analyser to evaluate the geometric quality of molecules from three representative structure-based de novo design models. I’ve put together a small GitHub repo with the full analysis code where we look at bond lengths, angles, torsions, and ring conformations across the three methods, and compare these against their PoseBusters validity scores to see what each metric is really capturing.
Continue readingCategory Archives: Code
Speeding up python through profiling
Python is a shockingly slow language. A test on a raspberry pi of simply “turn this pin on and off as fast as you can” gave the results below.
| System | Library | Speed |
|---|---|---|
| Shell | /proc/mem access | 2.8 kHz |
| Shell / gpio utility | WiringPi gpio utility | 40 Hz |
| Python | RPI.GPIO | 70 kHz |
| Python | wiringPi2 bindings | 28 kHz |
| Ruby | wiringPi bindings | 21 kHz |
| C | Native library | 22 MHz |
| C | BCM 2835 | 5.4 MHz |
| C | wiringPi | 4.1 – 4.6 MHz |
| Perl | BCM 2835 | 48 kHz |
SigmaDock: untwisting molecular docking with fragment-based SE(3) diffusion
Molecular docking sits at the heart of structure-based drug discovery. If we can reliably predict how a small molecule binds in a protein pocket, we can prioritize compounds faster, reason about interactions more clearly, and build better pipelines for hit discovery and lead optimization. But in practice, docking is still a difficult problem: classical methods are often robust but imperfect, while recent deep learning approaches have sometimes looked promising on headline metrics without consistently producing chemically plausible poses.
SigmaDock was built to address exactly that gap. Instead of treating docking as a problem of directly diffusing on torsion angles or unconstrained atomic coordinates, SigmaDock represents ligands as collections of rigid fragments and learns how to reassemble them inside the binding pocket using diffusion on . In plain English: rather than trying to “wiggle” every flexible degree of freedom in a tangled way, SigmaDock breaks the ligand into chemically meaningful rigid pieces and learns where those pieces should go, and how they should reorient, to recover a valid bound pose.

Nice TCR processing libraries
As someone who works with T cell antigen receptor (TCR) and peptide-major histocompatibility complex (pMHC) data, I have found several Python packages to be very useful for eliminating tedious steps in data cleaning and feature engineering stages.
Continue readingBuilding a “Second Brain” – A Functional Knowledge Stack with Obsidian

Whilst I always enjoy the acquisition of knowledge, I’ve always struggled with depositing it usefully. From pen and paper notes with a 20 colour theme which lost value with each additional colour, to OneNote or iPad GoodNotes based emulations of pen and paper, it’s been a constant quest for the optimal note taking schema. Personally there are 3 key objectives I need my note taking to achieve:
- It must be digitally compatible and accessible from any device.
- It must comfortably handle math and images.
- It must be something I look forward to – the software needs to be aesthetically clean, lightweight with none of the chunkiness of Microsoft apps, and highly customisable.
For me the solution to this was Obsidian, the perhaps more cultified sibling to Notion. Obsidian is a note taking application that uses markdown with a surprising amount flexibility, including the ability to partner it with an LLM which I’ll explore in this blog, alongside my vault organisation do or dies, and favourite customisations.
Continue readingAdvanced PyMOL Visualization for Weighted Structural Ensembles (Part 2): Efficient Weighted SASA Surfaces
In Part 1, we covered reference state handling, RMSD-based coloring, and cluster visualization for weighted structural ensembles. Now we tackle a more ambitious goal: generating solvent-accessible surface area (SASA) surfaces that reflect the weighted conformational distribution of your ensemble.
Why surfaces? Because they show the accessible conformational space—where your protein can actually be found, weighted by population. This is particularly powerful when comparing different fitting methods or showing how experimental constraints reshape the ensemble.
The challenge? A typical ensemble might have 500+ frames, each generating thousands of surface points. Naive approaches choke on the computational and memory demands. This post shares the optimizations that make weighted SASA visualization practical.
Continue readingDemocratising the Dark Arts: Writing Triton Kernels with Claude
Why would you ever want to leave the warm, fuzzy embrace of torch.nn? It works, it’s differentiable, and it rarely causes your entire Python session to segfault without a stack trace. The answer usually comes down to the “Memory Wall.” Modern deep learning is often less bound by how fast your GPU can do math (FLOPS) and more bound by how fast it can move data around (Memory Bandwidth). When you write a sequence of simple PyTorch operations, something like x = x * 2 + y the GPU often reads x from memory, multiplies it, writes it back, reads it again to add y, and writes it back again. It’s the computational equivalent of making five separate trips to the grocery store because you forgot the eggs, then the milk, then the bread. Writing a custom kernel lets you “fuse” these operations. You load the data once, perform a dozen mathematical operations on it while it sits in the ultra-fast chip registers, and write it back once. The performance gains can be massive (often 2x-10x for specific layers).But traditionally, the “cost” of accessing those gains, learning C++, understanding warp divergence, and manual memory management, was just too high for most researchers. That equation is finally changing.
Finding 250GB of Missing Storage On My Mac: A Warning For Large Dataset Users
I recently faced a puzzling issue: my 1TB MacBook Pro showed only 150GB free, but disk analyzers could only account for about 500GB of used space. After hours of troubleshooting, I discovered that Spotlight’s search index had balooned to 233GB, hundreds of times larger than normal.
The Problem
Standard disk analyzers showed that my mac had 330GB of “Inaccessible Disk Space” and 66GB of “Purgeable Disk Space” but no clear explanation for where my storage went. Removing the purgeable space was easy enough with sudo purge but none of the recommended fixes from ChatGPT like clearing Time Machine snapshots, clearing unused conda packages with pip cache purge and conda clean --all, and restarting the computer had any effect on the inaccessible disk space.
Using Node-RED as a front-end to your software
Node-RED is an, open-source, visual programming tool that lets you wire together hardware (such as sensors), APIs (such as REST/POST) and custom functions. However, its custom functions aren’t simply the JavaScript you write, they can also be containers!
This can provide an intuitive front-end to otherwise difficult software. For example, you’ve written your magnum opus, you’ve even documented it (though no-one will ever read it) and to ensure maximum compatibility for the widest possible audience, you’ve containerised it. But it’s still a command-line driven application. Using node-RED you can make this accessible to an inexperienced audience.

Out of the box, node-RED’s quite pretty, you can string together nodes to perform functions that are useful. In this case, it’s for monitoring a log file, if the log doesn’t grow, something’s gone wrong, so email me to take a look at it.
Extracting 3D Pharmacophore Points with RDKit
Pharmacophores are simplified representations of the key interactions ligands make with proteins, such as hydrogen bonds, charge interactions, and aromatic contacts. Think of them as the essential “bumps and grooves” on a key that allow it to fit its lock (the protein). These maps can be derived from ligands or protein–ligand complexes and are powerful tools for virtual screening and generative models. Here, we’ll see how to extract 3D pharmacophore points from a ligand using RDKit.
(Code adapted from Dr. Ruben Sanchez.)
Why pharmacophore “points”?
RDKit represents each pharmacophore feature (donor, acceptor, aromatic, etc.) as a point in 3D space, located at the feature center. These points capture the essential interaction motifs of a ligand without requiring the full atomic detail.
Continue reading