Cheating at Spelling Bee using the command line


The New York Times Spelling Bee is a free online word game, where players must construct as many words as possible using the letters provided in the Bee’s grid, always including the middle letter. Bonus points for using all the letters and creating a pangram.


However, this is the kind of thing which computers are very good at. If you’ve become frustrated trying to decipher the abstruse ways of the bee, let the command line help.

tl;dr:

grep -iP "^[adlokec]{6,}$" /usr/share/dict/words |grep a | awk '{ print length, $0 }' |sort -n |cut -f2 -d" "
Continue reading

Featurisation is Key: One Version Change that Halved DiffDock’s Performance

1. Introduction 

Molecular docking with graph neural networks works by representing the molecules as featurized graphs. In DiffDock, each ligand becomes a graph of atoms (nodes) and bonds (edges), with features assigned to every atom using chemical properties such as atom type, implicit valence and formal charge. 
 
We recently discovered that a change in RDKit versions significantly reduces performance on the PoseBusters benchmark, due to changes in the “implicit valence” feauture. This post walks through: 

  • How DiffDock featurises ligands 
  • What happened when we upgraded RDKit 2022.03.3 → 2025.03.1 
  • Why training with zero-only features and testing on non-zero features is so bad 

TL:DR: Use the dependencies listed in the environment.yml file, especially in the case of DiffDock, or your performance could half!  

2. Graph Representation in DiffDock 

DiffDock turns a ligand into input for a graph neural net by 

  1. Loading the ligand from an SDF file via RDKit. 
  1. Stripping all hydrogens to keep heavy atoms only. 
  1. Featurising each atom into a 16-dimensional vector: 

0: Atomic number 

1: Chirality tag 

2: Total bond degree 

3: Formal charge 

4: Implicit valence 

5: Number of implicit H’s 

6: Radical electrons 

7: Hybridisation 

8: Aromatic flag 

9-15: Ring-membership flags (rings of size 3–8) 

  1. Building a PyG HeteroData containing node features and bond-edges. 
  1. Randomizing position, orientation and torsion angles before inputting to the model for inference. 

3. PoseBusters Benchmark & RDKit Version Bump 

Using the supplied evaluation.py script which docks into whichever protein chains the ground truth is bound to, we evaluated on the 428-complex PoseBusters set using two different RDKit versions: 

RDKit version <2rmsd success rate 
2022.03.3 50.89 % 
2025.03.1 23.72 % 

With no changes other than the RDKit version, the success rate dropped by over half. 

Having checked the evaluation and conformer-generation steps, I took a more detailed look at the preprocessed data being fed into the model using the different RDKit versions. Everything was identical except implicit valence: 
– RDKit 2022.03.3: implicit valence = 0 for every atom 
– RDKit 2025.03.1: implicit valence ranges from 0-3  

Relevant Changes to RDKit’s GetImplicitValence() 

Between 2022.03.3 and 2025.03.1, RDKit was refactored so that implicit hydrogen counts are recomputed rather than permanently zeroed out after stripping explicit H’s. 

Old 2022.03.3 behavior: 

  • RemoveHs() deletes all explicit hydrogens and sets each heavy atom’s internal flag df_noImplicit = true, keeping only a heavy atom representation. 
  • Once df_noImplicit is set, asking for implicit valence always returns 0, even if you re-run sanitization. 

New 2025.03.1 behavior: 

  • RemoveHs() deletes explicit hydrogens but does not flag df_noImplicit = true, allowing recomputation of implicit valence. 
  • Sanitization calculates implicit valence = allowed valence – sum of explicit bonds 
  • GetImplicitValence() returns the correct implicit valence, even after stripping all H’s. 

These changes mean: 
Old (2022.03.3): RemoveHs() → df_noImplicit → GetImplicitValence() always 0 
New (2025.03.1): RemoveHs() (flag untouched) → sanitization recomputes → GetImplicitValence() returns the correct implicit-H count 
 
Because DiffDock was only ever trained on zeros at that index, suddenly inputting non-zero values at inference caused this collapse in performance. 

We force-zeroed that column and recovered peformance under the new RDkit, validating that this caused the drop in performance: 

– implicit_valence = atom.GetImplicitValence() 

+ implicit_valence = 0 

RDKit build Success rate 
2022.03.3 baseline 50.89 % 
2025.03.1 unpatched 23.72 % 
2025.03.1 patched 50.26 % 

4. Why Zero-Trained → Non-Zero-Tested Is So Bad  

The weight, w, controls how much “implicit valence” influences the network. There’s also a built-in bias b and an activation function ϕ. Together they compute: 
 
    output = ϕ (w v + b) 

Where v is the implicit valence feature. 

What Happens When You Train on Only Zeros? 

  • Implicit valence (v) = 0 every time you train. 
  • Since the input is always zero, there’s no signal telling w to move. In the absence of an explicit mechanism for the weights to become zero, such as weight decay, they will remain non-zero. 
  • Effectively, the model learns that the implicit valence feature column doesn’t matter, and w remains at the random starting point. 

What happens at test time? 

  • The implicit valence feature (v) might be 1, 2, or 3 now. 
  • The unchanged, random w multiplies this new v, producing unpredictable activations ϕ(wrandom v+b). 
  • These activations continue through downstream layers to the final prediction output. 

5. Conclusion 

Featurisation is very important – in the case of DiffDock, one library tweak changed one feature column and halved the performance! The fix was easy once it was found, but remember: 

  1. Featurization is key  
  1. Particularly in the case of DiffDock, use the listed dependency versions! 
  1. If you see a sudden large change in performance, it might be worth checking the package versions and the features… 

Happy docking! 

Slurm and Snakemake: a match made in HPC heaven

Snakemake is an incredibly useful workflow management tool that allows you to run pipelines in an automated way. Simply put, it allows you to define inputs and outputs for different steps that depend on each other, Snakemake will then run jobs only when the required inputs have been generated by previous steps. A previous blog post by Tobias is a good introduction to it – https://www.blopig.com/blog/2021/12/snakemake-better-workflows-with-your-code/

However, often pipelines are computationally intense and we would like to run them on our HPC. Snakemake allows us to do this on slurm using an extension package called snakemake-executor-plugin-slurm.

Continue reading

A Masterclass in Basic & Translational Immunology with Prof. Abul Abbas

On Thursday 17th April, a group of us made the journey ‘up the hill’ to the Richard Doll building to attend an immunology masterclass from Professor Abul Abbas. Prof. Abbas is an emeritus professor in Pathology at UCSF and author of numerous core textbooks including Basic Immunology: Functions and Disorders of the Immune System.

The whole-day course consisted of a series of lectures covering core topics in immunology, from innate immunity and antigen presentation through to B/T cell subsets, autoimmunity, and immunotherapy.

Continue reading

AI generated linkers™: a tutorial

In molecular biology cutting and tweaking a protein construct is an often under-appreciated essential operation. Some protein have unwanted extra bits. Some protein may require a partner to be in the correct state, which would be ideally expressed as a fusion protein. Some protein need parts replacing. Some proteins disfavour a desired state. Half a decade ago, toolkits exists to attempt to tackle these problems, and now with the advent of de novo protein generation new, powerful, precise and way less painful methods are here. Therefore, herein I will discuss how to generate de novo inserts and more with RFdiffusion and other tools in order to quickly launch a project into the right orbit.
Furthermore, even when new methods will have come out, these design principles will still apply —so ignore the name of the de novo tool used.

Continue reading

NVIDIA Reimagines CUDA for Python Developers

According to GitHub’s Open Source Survey, Python has officially become the world’s most popular programming language in 2024 – ultimately surpassing JavaScript. Due to its exceptional popularity, NVIDIA announced Python support for its CUDA toolkit at last year’s GTC conference, marking a major leap in the accessibility of GPU computing. With the latest update (https://nvidia.github.io/cuda-python/latest/) and for the first time, developers can write Python code that runs directly on NVIDIA GPUs without the need for intermediate C or C++ code.

Historically tied to C and C++, CUDA has found its way into Python code through third-party wrappers and libraries. Now, the arrival of native support means a smoother, more intuitive experience.

This paradigm shift opens the door for millions of Python programmers – including our scientific community – to build powerful AI and scientific tools without having to switch languages or learn legacy syntax.

Continue reading

Debugging code for science: Fantastic Bugs and Where to Find Them.

The simulation results make no sense … My proteins are moving through walls and this dihedral angle is negative; my neural network won’t learn anything, I’ve tried for days to install this software and I still get an error.

Feel familiar? Welcome to scientific programming. Bugs aren’t just annoying roadblocks – they’re mysterious phenomena that make you question your understanding of reality itself. If you’ve ever found yourself debugging scientific code, you know it’s a different beast compared to traditional software engineering. In the commercial software world, a bug might mean a button doesn’t work or data isn’t saved correctly. In scientific computing, a bug might mean your climate model predicts an ice age next Tuesday, or your protein folding algorithm creates molecular structures that couldn’t possibly exist in our universe (cough).

Continue reading

GUI Science

There comes a point in every software-inclined lab based grad student’s life, where they think: now is the time to write a GUI for my software, to make it fast, easy to use, generalised so that others will use it too, the new paradigm for how to do research, etc. etc.

Of course such delusions of grandeur are rarely indulged, but when executed they certainly can produce useful outputs, as a well designed (or even, designed) GUI can improve an experimentalist’s life profoundly by simplifying, automating and standardising data acquisition, and by reducing the time to see results, allowing for shorter iteration cycles (this is known in engineering as “Design, Build, Test, Learn” cycle, in software it’s called “Coding”).

Having written a few GUIs in my time, I thought it might be helpful to share some experience I have, though it is by no means broad.

Continue reading

LOADING: an art and science collaborative project

For the past few months, OPIGlets Gemma, Charlie and Alexi have been engaged in a collaboration between scientists from Oxford and artists connected to Central St Martins art college in London. This culminated in February with the publication of a zine detailing our work, and a final symposium where we presented our projects to the wider community.

This collaboration was led by organisers Barney Hill and Nina Gonzalez-Park and comprised a series of workshops in various locations across Oxford and London, where the focus was to discuss commonalities between contemporary artistic and scientific research and the concept of transdisciplinary work. Additionally, scientists and artists were paired up to explore shared interests, with the goal of creating a final piece to exhibit.

Continue reading

Holding out for a GPU

There comes a time in every OPIGlet’s life –all too often, if you ask me– when one is asked to write a blog post. Oh, what frightful fate to befall one! What topic to select? To what end? Something surprising? Something insightful! Something useful, at least.

Not today. Today, dear reader, we address instead the singularly most enduring, deeply seated, fervently felt yearnings of the computational scientist – a fervour woven from every fibre of their being, a longing so forceful it rends their very heart, a passion that burns with the fire of a thousand suns.

Continue reading