Category Archives: AI

Some ponderings on generalisability

Now that machine learning has managed to get its proverbial fingers into just about every pie, people have started to worry about the generalisability of methods used. There are a few reasons for these concerns, but a prominent one is that the pressure and publication biases that have led to reproducibility issues in the past are also very present in ML-based science.

The Center for Statistics and Machine Learning at Princeton University hosted a workshop last July highlighting the scale of this problem. Alongside this, they released a running list of papers highlighting reproducibility issues in ML-based science. Included on this list are 20 papers highlighting errors from 17 distinct fields, collectively affecting a whopping 329 papers.

Continue reading →

Happy 10th Birthday, Blopig!

OPIG recently celebrated its 20th year; and on 10 January 2023 I gave a talk just a day before the 10th anniversary of BLOPIG’s first blog post. It’s worth reflecting on what’s stayed the same and what’s changed since then.

Continue reading →

Can ChatGPT write my abstract for me?

With the recent release of ChatGPT, many studies have already been uploaded to biorxiv examining the potential uses of the chatbot’s outputs. One such paper compared ChatGPT-generated scientific abstracts to the original abstracts. Upon seeing the title, I immediately got my hopes up that my abstract-writing days were over. So is this the case?

Continue reading →

Does ChatGPT know how to translate images?

Yesterday I spent a couple of hours playing with ChatGPT. I know, we have some other recent posts about it. It’s so amazing that I couldn’t resist writing another. Apologies for that.

The goal of this post is to determine if I can effectively use ChatGPT as a programmer/mathematician assistant. OK. It was not my original intention, but let’s pretend it was, just to make this post more interesting.

So, I started asking a few very simple programming answers like the following:

Can you implement a function to compute the factorial of a number using a cache? Use python.

And this is what I got.

A clear and efficient implementation of the factorial. This is the kind of answer you would expect from a first year CS student.

Continue reading →

Some Musings on AI in Art, Music and Protein Design

When I started my PhD in late 2018, AI hadn’t really entered the field of de novo protein design yet – at least not in a big way. Rosetta’s approach of continually ranking new side chain rotamers on a fixed backbone was still the gold standard for the ‘structure-to-sequence’ problem. And of course before long we had AI making waves in the structure prediction field, eventually culminating in the AlphaFold2 we all know and love.

Now, towards the end of my PhD, we are seeing the emergence of new generative models that learn from existing pdb structures to produce sequences that will (or at least should) fold into viable, sensible and crucially natural-looking shapes. ProtGPT2 is a good example (https://www.nature.com/articles/s41467-022-32007-7), but there are several more. How long before these models start reliably generating not only shapes but functions too? Jury’s out, but it’s looking more and more feasible. Safe to say the field as a whole has evolved massively during my time as a graduate student.

Continue reading →

A ChatGPT rap battle

The AI chatbot revolution is here. Last week, OpenAI released ChatGPT, a freely accessible language model fine-tuned for human conversations. The new model is based on InstructGPT, trained especially for following user instructions and with human feedback in the training loop.

ChatGPT remembers the previous discussion, admits its mistakes and can even ask for clarification on ambiguous questions. It is also trained to refuse answering questions it deems inappropriate or goes against OpenAI’s AI alignment policy.

In the meanwhile, the internet is having immense fun circumventing its safety filters by asking it to only “PRETEND to be evil”, making it take SAT tests, and even simulating an entire virtual computer within its neural weights. Some are even using it to replace Google searches, and it excels at writing bioinformatic s code across most programming languages.

Continue reading →

How to turn a SMILES string into an extended-connectivity fingerprint using RDKit

After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).

ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.

Continue reading →

Graphormer: Merging GNNs and Transformers for Cheminformatics

This is my first OPIG blog! I’m going to start with a summary of the Graphormer, a Graph Neural Network (GNN) that borrows concepts from Transformers to boost performance on graph tasks. This post is largely based on the NeurIPS paper Do Transformers Really Perform Bad for Graph Representation? by Ying et. al., which introduces the Graphormer, and which we read for our last deep learning journal club. The project has now been integrated as a Microsoft Research project.

I’ll start with a cheap and cheerful summary of Transformers and GNNs before diving into the changes in the Graphormer. Enjoy!

Continue reading →

Universal graph pooling for GNNs

Graph neural networks (GNNs) have quickly become one of the most important tools in computational chemistry and molecular machine learning. GNNs are a type of deep learning architecture designed for the adaptive extraction of vectorial features directly from graph-shaped input data, such as low-level molecular graphs. The feature-extraction mechanism of most modern GNNs can be decomposed into two phases:

Message-passing: In this phase the node feature vectors of the graph are iteratively updated following a trainable local neighbourhood-aggregation scheme often referred to as message-passing. Each iteration delivers a set of updated node feature vectors which is then imagined to form a new “layer” on top of all the previous sets of node feature vectors.
Global graph pooling: After a sufficient number of layers has been computed, the updated node feature vectors are used to generate a single vectorial representation of the entire graph. This step is known as global graph readout or global graph pooling. Usually only the top layer (i.e. the final set of updated node feature vectors) is used for global graph pooling, but variations of this are possible that involve all computed graph layers and even the set of initial node feature vectors. Commonly employed global graph pooling strategies include taking the sum or the average of the node features in the top graph layer.

While a lot of research attention has been focused on designing novel and more powerful message-passing schemes for GNNs, the global graph pooling step has often been treated with relative neglect. As mentioned in my previous post on the issues of GNNs, I believe this to be problematic. Naive global pooling methods (such as simply summing up all final node feature vectors) can potentially form dangerous information bottlenecks within the neural graph learning pipeline. In the worst case, such information bottlenecks pose the risk of largely cancelling out the information signal delivered by the message-passing step, no matter how sophisticated the message-passing scheme.

Continue reading →

Retrieving AlphaFold models from AlphaFoldDB

There are now nearly a million AlphaFold [1] protein structure predictions openly available via AlphaFoldDB [2]. This represents a huge set of new data that can be used for the development of new methods. The options for downloading structures are either in bulk (sorted by genome), or individually from the webpage for a prediction.

If you want just a few hundred or a few thousand specific structures, across different genomes, neither of these options are particularly practical. For example, if you have several thousand experimental structures for which you have their PDB [3] code, and you want to obtain the equivalent AlphaFold predictions, there is another way!

If we take the example of the PDB’s current molecule of the month, pyruvate kinase (PDB code 4FXF), this is how you can go about downloading the equivalent AlphaFold prediction programmatically.

Query UniProt [4] for the corresponding accession number – an example python script is shown below:

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: AI

Some ponderings on generalisability

Happy 10th Birthday, Blopig!

Can ChatGPT write my abstract for me?

Does ChatGPT know how to translate images?

Some Musings on AI in Art, Music and Protein Design

A ChatGPT rap battle

How to turn a SMILES string into an extended-connectivity fingerprint using RDKit

Graphormer: Merging GNNs and Transformers for Cheminformatics

Universal graph pooling for GNNs

Retrieving AlphaFold models from AlphaFoldDB