How to Blopig

Blopig has a wealth of knowledge, with everything from a Bayesian answer to the question “should ketchup be stored in the fridge?“* to the Nobel-Prize-Winner-approved analysis of AlphaFold2. Blopig runs on WordPress and uses blocks, components for adding different types of content to a post. These are blocks like paragraphs, headers, images, image galleries, and videos. Here are some hints and tips for getting the most out of WordPress.

One of the first blocks worth mentioning is the “Read More…” block…

Continue reading →

How to turn a SMILES string into a molecular graph for Pytorch Geometric

Despite some of their technical issues, graph neural networks (GNNs) are quickly being adopted as one of the state-of-the-art methods for molecular property prediction. The differentiable extraction of molecular features from low-level molecular graphs has become a viable (although not always superior) alternative to classical molecular representation techniques such as Morgan fingerprints and molecular descriptor vectors.

But molecular data usually comes in the sequential form of labeled SMILES strings. It is not obvious for beginners how to optimally transform a SMILES string into a structured molecular graph object that can be used as an input for a GNN. In this post, we show how to convert a SMILES string into a molecular graph object which can subsequently be used for graph-based machine learning. We do so within the framework of Pytorch Geometric which currently is one of the best and most commonly used Python-based GNN-libraries.

We divide our task into three high-level steps:

We define a function that maps an RDKit atom object to a suitable atom feature vector.
We define a function that maps an RDKit bond object to a suitable bond feature vector.
We define a function that takes as its input a list of SMILES strings and associated labels and then uses the functions from 1.) and 2.) to create a list of labeled Pytorch Geometric graph objects as its output.

Continue reading →

NeurIPS 2021 Conference Feedback

Held annually in December, the Neural Information Processing Systems meetings aim to encourage researchers using machine learning techniques in their work – whether it be in economics, physics, or any number of fields – to get together to discuss their findings, hear from world-leading experts, and in many years past, ski. The virtual nature of this year’s conference had an enormously negative impact on attendees’ skiing experiences, but it nevertheless was a pleasure to attend – the machine learning in structural biology workshop, in particular, provided a useful overview of the hottest topics in the field, and of the methods that people are using to tackle them.

This year’s NeurIPS highlighted the growing interest in applying the newest Natural Language Processing (NLP) algorithms on proteins. This includes antibodies, as seen by two presentations in the MLSB workshop, which focused on using these algorithms for the discovery and design of antibodies. Ruffolo et al. presented their version of a BERT-inspired language model for antibodies. The purpose of such a model is to create representations that encapsulate all information of an antibody sequence, which can then be used to predict antibody properties. In their work, they showed how the representations could be used to predict high-redundancy sequences (a proxy for strong binders) and how continuous trajectories consistent with the number of mutations could be observed when using umap on the representations. While such representations can be used to predict properties of antibodies, another work by Shuai et al. instead focused on training a generative language model for antibodies, able to generate a region in an antibody based on the rest of the antibody. This can then potentially be used to generate new viable CDR regions of variable length, better than randomly mutating them.

Continue reading →

Python’s Data Classes

When writing code, you have inevitably needed to store data throughout your pipeline. In these cases you store your value, list or data frame as a variable to easily use it elsewhere in your code. However, sometimes your data has an awkward form, consisting of a number of different length lists or data of different types and sizes. While it is still doable to work with, and using tuples or dictionaries can help, accessing different elements in your data quickly becomes messy and it is less intuitive what your code is actually doing.

To solve the above stated problem, data classes were introduced as a new feature in Python 3.7. A data class is a regular Python class, but with certain methods already implemented for you. This makes them easy to create and removes a lot of boilerplate (repeated code) making them simpler, more intuitive and pretty. Further, as data classes are part of the standard library, you can directly import it without needing to install any external dependencies (noice).

With the sales pitch out of the way, let us look at how we can use data classes.

from dataclasses import dataclass
from typing import Any

@dataclass
class Antibody:
    vgene: str
    jgene: None
    sequence: Any = 'EVQ'

Continue reading →

A quantitative way to measure targeted protein degradation

Whenever we order consumables in the Chemistry department, the whole lab gets an email notification once they arrive. So I can understand why I got some puzzled reactions from my colleagues when one such email arrived saying that my ‘artichoke’ was ready to collect from stores. Had I been sneakily doing my grocery shopping on a university research budget?

Artichoke is, in fact, the name of a plasmid designed by the Ebert lab (https://www.addgene.org/73320/), which I have been using in some of my research on targeted protein degradation. The premise is simple enough: genes for two different fluorescent proteins, one of which is fused to a protein-of-interest.

Continue reading →

Tracking Changes in LaTeX

Tracking changes in Microsoft Word is easy – we just click the ‘Track Changes’ button. It requires a little more work in LaTeX but here is a quick guide to doing it as painlessly as possible!

First, we want our original document to be stored in a *.tex file in an Overleaf project (as shown below)

Continue reading →

What is a plantibody?

Plants can be genetically engineered to express non-native proteins, for example, crops can be engineered to produce insect toxins in order to improve disease-resistance. However, I was not aware of their ability to express antibodies until, inspired by my expanding collection of house plants, I googled ‘plant immune systems’.

Plants don’t naturally produce antibodies – they do not possess an adaptive immune system or any circulating immune defence cells. Despite this, plants can be made to express and assemble full length antibody heavy chains and light chains. This was first published back in 1989, when Hiatt et al. [1] successfully introduced mouse immunoglobulin genes to tobacco plants and produced functional antibodies with reasonable efficiency. The excellent term ‘plantibody‘ was coined soon after, to refer to antibodies and fragments of antibodies produced by plants transformed with antibody-coding genes.

Continue reading →

Solving WORDLE with grep

People seem to have become obsessed with wordle, just like they became obsessed with sudoku. After my initial burst of “oh a new game!” had waned, I was left thinking “my time is precious and this is exactly what we have computers for”. With this in mind, below is my quick and dirty way of solving these. I’m sure the regexp gurus amongst you will have a more elegant solution.

Step 1: Make sure you’ve got /usr/share/dict/words installed. This is just a huge list of words in a specific language and for me, this required installing the British words list.

sudo apt-get install wbritish

Step 2: Go to wordle

Step 3: Pick a random 5-letter word as your starting point. This is where grep and /usr/share/dict/words comes in:

Continue reading →

Simplify your life with SLURM and sync

For my first blog post of the year, we’re talking about SLURM, everyone’s favorite job manager. If like me, you have the joy of running a literal boat-load of jobs with all kinds of parameters and command-line arguments you’ll know there are a few tips and tricks that make the process of managing these tasks and results as painless as possible. Now, I do expect most people reading this will already be aware of these tricks but for those who don’t, I hope this is helpful. After all, it’s impossible to know what you don’t know you need to know, you know? Any alternatives, improvements, or suggestions are welcome!

Array Jobs

Job arrays are perfect for the times you want to run the same job several times with slight differences each time. Imagine you need to repeat a job 10 times with slightly different arguments with each run. Rather than submit 10 (slightly different) batch scripts you can submit 1 script with all the information needed to complete all 10 jobs.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

How to Blopig

How to turn a SMILES string into a molecular graph for Pytorch Geometric

NeurIPS 2021 Conference Feedback

Python’s Data Classes

A quantitative way to measure targeted protein degradation

Tracking Changes in LaTeX

What is a plantibody?

Solving WORDLE with grep

Simplify your life with SLURM and sync

Array Jobs