Category Archives: Uncategorized

Working with PDB Structures in Pandas

Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.

Continue reading

The workings of Fragmenstein’s RDKit neighbour-aware minimisation

Fragmenstein is a Python module that combine hits or position a derivative following given templates by being very strict in obeying them. This is done by creating a “monster”, a compound that has the atomic positions of the templates, which then reanimated by very strict energy minimisation. This is done in two steps, first in RDKit with an extracted frozen neighbourhood and then in PyRosetta within a flexible protein. The mapping for both combinations and placements are complicated, but I will focus here on a particular step the minimisation, primarily in answer to an enquiry, namely how does the RDKit minimisation work.

Continue reading

Le Tour de Farce 2023

16:30 BST 27/06/2023 Oxford, UK. A large number of scientists were spotting riding bicycles across town, to the consternation of onlookers. The event was the Oxford Protein Informatics Group (OPIG) “tour de farce” 2023. A circular bike ride from the Department of Statistics, to The Up in Arms (Marston), The Trout Inn (Godstow), The Perch (Port Meadow) and The Holly Bush (Osney Island). This spurred great bystander-anxiety due to one of a multitude of factors: the impressive size of the jovial horde, the erraticism of the cycling, the deplorable maintenance of certain bikes, and the unchained bizarrerie of the overheard dialogue.

Dissociated Press.
Continue reading

KAUST Computational Advances in Structural Biology

Last month, I had the privilege of being invited to the KAUST Research Conference on Computational Advances in Structural Biology, held from May 1-3, 2023. This gave me the opportunity to present some of the latest OPIG works on small molecules while visiting an exceptional campus with state-of-the-art facilities in one of those corners of the world that are not widely known. Moreover, the experience went beyond the impressive surroundings as I had the chance to attend a highly engaging conference and meet many scientists from different backgrounds.

KAUST Library (left) and Dinning Hall (right)

The conference brought together experts in the field to explore cutting-edge developments in computational structural biology. It had a primary focus on advancements in protein structure prediction, multi-scale simulations, and integrative structural biology. Cryo-electron microscopy (cryo-EM) was the most popular experimental technique, with more than a third of the talks dedicated to its applications. These talks showcased impressive examples where structure prediction, simulations, and mid-resolution cryo-EM maps were combined to construct atomic models of large macromolecular complexes.

Notable examples of integrative works were presented by Jan Kosinski and Thomas Miller, among others. Jan Kosinski shared insights into the model of the human nuclear pore complex, highlighting the integration of cryo-electron tomography (cryo-ET), prior experimental knowledge, and AlphaFold predictions. Thomas Miller, on the other hand, presented his work on EM-based visual biochemistry, which combines single-particle cryo-electron microscopy (cryo-EM), and time-resolved experiments, as a tool to study the molecular mechanisms of eukaryotic DNA replication.

There were also several talks about novel algorithms. Nazim Bouatta presented some less-known details about OpenFold and introduced some of their approaches to tackling the problem of multimer modelling. He also announced the future release of folding methods for predicting protein-ligand complexes. Jianlin Cheng presented MULTICOM, their new protein structure predictor based on consensus predictions from Alphafold. Sergei Grudinin showed deep-learning tools able to predict protein dynamics as well as some integrative modelling tools driven by low-resolution experimental observations, such as small-angle scattering.

On the cryo-EM methods side, Mikhail Kudryashev presented TomoBEAR and SUSAN, cryoEM tools developed to automatize the analysis of tomographic data. Johannes Schwab presented dynamight, a deep learning-based approach for heterogeneity analysis in single particle cryo-EM. While, on the ComChem side, Haribabu Arthanari showed their ultra-large Virtual screening platform and Jean-Louis Reymond talked about tools to enumerate, visualize and search the vast chemical space of drug-like molecules

Overall, the conference provided a quite diverse set of talks that facilitated multidisciplinary views and discussions. From protein structure prediction to integrative approaches combining experimental and computational methods, the talks showed the transformative potential of computational analysis in unravelling the complexities of biological macromolecules.

Better histograms with Python

Histograms are frequently used to visualize the distribution of a data set or to compare between multiple distributions. Python, via matplotlib.pyplot, contains convenient functions for plotting histograms; the default plots it generates, however, leave much to be desired in terms of visual appeal and clarity.

The two code blocks below generate histograms of two normally distributed sets using default matplotlib.pyplot.hist settings and then, in the second block, I add some lines to improve the data presentation. See the comments to determine what each individual line is doing.

Continue reading

The ultimate modulefile for conda

Environment modules is a great tool for high-performance computing as it is a modular system to quickly and painlessly enable preset configurations of environment variables, for example a user may be provided with modulefile for an antiquated version of a tool and a bleeding-edge alpha version of that same tool and they can easily load whichever they wish. In many clusters the modules are created with a tool called EasyBuild, which delivered an out-of-the-box installation. This works for things like a single binary, but for conda this severely falls short as there are many many configuration changes needed.

Continue reading

On The Logic of GOing with Weisfeiler-Lehman

Recently, I was able to attend Martin Grohe’s talk on The Logic of Graph Neural Networks. Professor Grohe of RWTH Aachen University, is a titan of the fields of Logic and Complexity theory. Even so, he is modest about his achievements, and I was tickled when it was pointed out to me that the theorem he refers to as “a little complex”, one of his crowning achievements, involves a four-hundred page long book of a proof.

The theorem relates to the Weisfeiler-Lehmann (WL) algorithm, an algorithm for determining whether two graphs are equivalent (i.e. isomorphic). The algorithm has deep connections with combinatorics, complexity theory and first order logic. A system of logic that is remarkably similar to the relations present in ontologies such as the Gene Ontology (GO), which is commonly used to compare and predict protein function. Kernelised methods and other WL-based metrics present a new and possibly logically “complete” way to potentially compare the functions of proteins and infer their similarity.

The Gene Ontology follows a simple set of rules, very similar to first order logic. From the GO Database Description
Continue reading

COSTNET19 Conference

Last month, I attended the COSTNET19 Conference in Bilbao (Spain). This conference is organised by COSTNET, a COST Action which aims to foster international European collaboration on the emerging field of statistics of network data science. COSTNET facilitates interaction and collaboration between diverse groups of statistical network modellers, establishing a large and vibrant interconnected and inclusive community of network scientists.

Continue reading

Why you should care about startups as a researcher

I was recently awarded the EIT Health Translational Fellowship, which aims to fund DPhil projects with the goal of commercializing the research and addressing the funding gap between research and seed funding. In order to win, I had to deliver a short 5 minute startup pitch in front of a panel of investors and scientific experts to convince them that my DPhil project has impact as well as commercial viability. Besides the £5000 price, the fellowship included a week-long training course on how to improve your pitch, address pain points in your business strategy etc. I found the whole experience to be incredibly rewarding and the skills I picked up very important, even as a researcher. As a summary, this is why I think you should care about the startup world as a researcher.

Continue reading