Category Archives: Uncategorized

Pyrosetta for RFdiffusion

I will not lie: I often struggle to find a snippet of code that did something in PyRosetta or I spend hours facing a problem caused by something not working as I expect it to. I recently did a tricky project involving RFdiffusion and I kept slipping on the PyRosetta side. So to make future me, others, and ChatGTP5 happy, here are some common operations to make working with PyRosetta for RFdiffusion easier.

Continue reading →

Tracking the change in ML performance for popular small molecule benchmarks

The power of machine learning (ML) techniques has captivated the field of small molecule drug discovery. Increasingly, researchers and organisations have employed ML to create more accurate algorithms to improve the efficiency of the discovery process.

To be published, methods have to prove they have improved upon others. Often, methods are tested against the same benchmarks within a field, allowing us to track progress over time. To explore the rate of improvement, I curated the performance on three popular benchmarks. The first benchmark is CASF 2016, used to test the accuracy of methods that predict the binding affinity of experimental determined protein-ligand complexes. Accuracy was measured using the Pearson’s R value between predicted and experimental affinity values.

Continue reading →

Working with PDB Structures in Pandas

Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.

Continue reading →

The workings of Fragmenstein’s RDKit neighbour-aware minimisation

Fragmenstein is a Python module that combine hits or position a derivative following given templates by being very strict in obeying them. This is done by creating a “monster”, a compound that has the atomic positions of the templates, which then reanimated by very strict energy minimisation. This is done in two steps, first in RDKit with an extracted frozen neighbourhood and then in PyRosetta within a flexible protein. The mapping for both combinations and placements are complicated, but I will focus here on a particular step the minimisation, primarily in answer to an enquiry, namely how does the RDKit minimisation work.

Continue reading →

Le Tour de Farce 2023

16:30 BST 27/06/2023 Oxford, UK. A large number of scientists were spotting riding bicycles across town, to the consternation of onlookers. The event was the Oxford Protein Informatics Group (OPIG) “tour de farce” 2023. A circular bike ride from the Department of Statistics, to The Up in Arms (Marston), The Trout Inn (Godstow), The Perch (Port Meadow) and The Holly Bush (Osney Island). This spurred great bystander-anxiety due to one of a multitude of factors: the impressive size of the jovial horde, the erraticism of the cycling, the deplorable maintenance of certain bikes, and the unchained bizarrerie of the overheard dialogue.

Dissociated Press.

Continue reading →

KAUST Computational Advances in Structural Biology

Last month, I had the privilege of being invited to the KAUST Research Conference on Computational Advances in Structural Biology, held from May 1-3, 2023. This gave me the opportunity to present some of the latest OPIG works on small molecules while visiting an exceptional campus with state-of-the-art facilities in one of those corners of the world that are not widely known. Moreover, the experience went beyond the impressive surroundings as I had the chance to attend a highly engaging conference and meet many scientists from different backgrounds.

KAUST Library (left) and Dinning Hall (right)

The conference brought together experts in the field to explore cutting-edge developments in computational structural biology. It had a primary focus on advancements in protein structure prediction, multi-scale simulations, and integrative structural biology. Cryo-electron microscopy (cryo-EM) was the most popular experimental technique, with more than a third of the talks dedicated to its applications. These talks showcased impressive examples where structure prediction, simulations, and mid-resolution cryo-EM maps were combined to construct atomic models of large macromolecular complexes.

Notable examples of integrative works were presented by Jan Kosinski and Thomas Miller, among others. Jan Kosinski shared insights into the model of the human nuclear pore complex, highlighting the integration of cryo-electron tomography (cryo-ET), prior experimental knowledge, and AlphaFold predictions. Thomas Miller, on the other hand, presented his work on EM-based visual biochemistry, which combines single-particle cryo-electron microscopy (cryo-EM), and time-resolved experiments, as a tool to study the molecular mechanisms of eukaryotic DNA replication.

There were also several talks about novel algorithms. Nazim Bouatta presented some less-known details about OpenFold and introduced some of their approaches to tackling the problem of multimer modelling. He also announced the future release of folding methods for predicting protein-ligand complexes. Jianlin Cheng presented MULTICOM, their new protein structure predictor based on consensus predictions from Alphafold. Sergei Grudinin showed deep-learning tools able to predict protein dynamics as well as some integrative modelling tools driven by low-resolution experimental observations, such as small-angle scattering.

On the cryo-EM methods side, Mikhail Kudryashev presented TomoBEAR and SUSAN, cryoEM tools developed to automatize the analysis of tomographic data. Johannes Schwab presented dynamight, a deep learning-based approach for heterogeneity analysis in single particle cryo-EM. While, on the ComChem side, Haribabu Arthanari showed their ultra-large Virtual screening platform and Jean-Louis Reymond talked about tools to enumerate, visualize and search the vast chemical space of drug-like molecules

Overall, the conference provided a quite diverse set of talks that facilitated multidisciplinary views and discussions. From protein structure prediction to integrative approaches combining experimental and computational methods, the talks showed the transformative potential of computational analysis in unravelling the complexities of biological macromolecules.

Better histograms with Python

Histograms are frequently used to visualize the distribution of a data set or to compare between multiple distributions. Python, via matplotlib.pyplot, contains convenient functions for plotting histograms; the default plots it generates, however, leave much to be desired in terms of visual appeal and clarity.

The two code blocks below generate histograms of two normally distributed sets using default matplotlib.pyplot.hist settings and then, in the second block, I add some lines to improve the data presentation. See the comments to determine what each individual line is doing.

Continue reading →

The ultimate modulefile for conda

Environment modules is a great tool for high-performance computing as it is a modular system to quickly and painlessly enable preset configurations of environment variables, for example a user may be provided with modulefile for an antiquated version of a tool and a bleeding-edge alpha version of that same tool and they can easily load whichever they wish. In many clusters the modules are created with a tool called EasyBuild, which delivered an out-of-the-box installation. This works for things like a single binary, but for conda this severely falls short as there are many many configuration changes needed.

Continue reading →

On The Logic of GOing with Weisfeiler-Lehman

Recently, I was able to attend Martin Grohe’s talk on The Logic of Graph Neural Networks. Professor Grohe of RWTH Aachen University, is a titan of the fields of Logic and Complexity theory. Even so, he is modest about his achievements, and I was tickled when it was pointed out to me that the theorem he refers to as “a little complex”, one of his crowning achievements, involves a four-hundred page long book of a proof.

The theorem relates to the Weisfeiler-Lehmann (WL) algorithm, an algorithm for determining whether two graphs are equivalent (i.e. isomorphic). The algorithm has deep connections with combinatorics, complexity theory and first order logic. A system of logic that is remarkably similar to the relations present in ontologies such as the Gene Ontology (GO), which is commonly used to compare and predict protein function. Kernelised methods and other WL-based metrics present a new and possibly logically “complete” way to potentially compare the functions of proteins and infer their similarity.

The Gene Ontology follows a simple set of rules, very similar to first order logic. From the GO Database Description

Continue reading →

COSTNET19 Conference

Last month, I attended the COSTNET19 Conference in Bilbao (Spain). This conference is organised by COSTNET, a COST Action which aims to foster international European collaboration on the emerging field of statistics of network data science. COSTNET facilitates interaction and collaboration between diverse groups of statistical network modellers, establishing a large and vibrant interconnected and inclusive community of network scientists.

Continue reading →