Monthly Archives: December 2022

Quality Stats

Disclaimer – the title is a Quality Street pun only and bears no relation to the quality of the data or analysis presented below. This whole blog post is basically to discredit the personal chocolate preferences of a group member who shall remain nameless. Safe to say though, they Vostly overestimated people’s love for the Toffee Finger. Long live the Orange Creme.

Continue reading →

The exotic zoo of antibodies

When I think of antibodies, I usually think of the standard human Y-shaped IgG. It is easy to forget that the world of antibodies is extremely diverse, both in the constant domain, with many different isotypes (i.e. IgA, IgD, IgE, and IgM), and in the variable domain (i.e. with or without a light chain and CDR lengths). This is before we even start looking at engineered antibodies, like the ones illustrated in a previous blog post by Alissa.

Of the many different antibodies, in this blog post, I want to highlight some of the exotic naturally occurring antibodies which might not have gotten much attention yet, but which each have interesting features.

The standard antibody (i.e. humans, mouse)

This is the standard antibody which we will compare with. A protein complex of two paired heavy and light chains forming the well-known Y shape. At the tips, a binding site that consists mainly of the three CDR’s on each chain. Nice and simple.

Interesting facts:

Continue reading →

Does ChatGPT know how to translate images?

Yesterday I spent a couple of hours playing with ChatGPT. I know, we have some other recent posts about it. It’s so amazing that I couldn’t resist writing another. Apologies for that.

The goal of this post is to determine if I can effectively use ChatGPT as a programmer/mathematician assistant. OK. It was not my original intention, but let’s pretend it was, just to make this post more interesting.

So, I started asking a few very simple programming answers like the following:

Can you implement a function to compute the factorial of a number using a cache? Use python.

And this is what I got.

A clear and efficient implementation of the factorial. This is the kind of answer you would expect from a first year CS student.

Continue reading →

Two useful modules to help you find the best ML model for your task

FLAML and LazyPredict are two packages designed to quickly train and test machine learning models from scikit-learn so that you can determine which is the best type of model for learning from your data.

Continue reading →

Festival of Biologics 2022 – November 2-4 Basel, Switzerland

In November I attended the Festival of Biologics (FoB) 2022 conference in Basel, Switzerland. Originally a set of different conferences (now called agendas) that has merged into a single conference, FoB focuses on anything related to biologics. One of the agendas is an antibody specific agenda, derived from the former European Antibody Congress. This year the antibodies agenda had more than 100 talks across multiple tracks, covering many different aspects of using antibodies as therapeutics, making it an exciting conference for an antibody enthusiast. However, while FoB does include talks on machine learning and bioinformatics, most are focused solely on experimental work. Another drawback is that the majority of the talks are by industry, with the few academic speakers almost all also representing a company. This meant that of the few talks about computational methods and tools for protein design, most felt more like a commercial rather than a research presentation. Nonetheless, FoB is still an interesting conference to attend when you are working on applied research for antibody therapeutics. It is an amazing opportunity to hear about which antibody specific problems companies are trying to overcome, which are deemed solved and which are the future problems to solve.

Continue reading →

Bad chemistry in old protein-ligand binding complex data set

The Astex Diverse set [1] is a dataset containing the crystallized poses of 85 protein-ligand complexes. It was introduced in 2007 to address problems in previous datasets such as incorrect ligand representation.

Loading the 85 ligand files with today’s version of the cheminformatics toolkit RDKit [2] is, however, not as straightforward as you might expect.

Continue reading →

histo.fyi: A Useful New Database of Peptide:Major Histocompatibility Complex (pMHC) Structures

pMHCs are set to become a major target class in drug discovery; unusual peptide fragments presented by MHC can be used to distinguish infected/cancerous cells from healthy cells more precisely than over-expressed biomarkers. In this blog post, I will highlight a prototype resource: Dr. Chris Thorpe’s new database of pMHC structures, histo.fyi.

histo.fyi provides a one-stop shop for data on (currently) around 1400 pMHC complexes. Similar to our dedicated databases for antibody/nanobody structures (SAbDab) and T-cell receptor (TCR) structures (STCRDab), histo.fyi will scrape the PDB on a weekly basis for any new pMHC data and process these structures in a way that facilitates their analysis.

Continue reading →

Some Musings on AI in Art, Music and Protein Design

When I started my PhD in late 2018, AI hadn’t really entered the field of de novo protein design yet – at least not in a big way. Rosetta’s approach of continually ranking new side chain rotamers on a fixed backbone was still the gold standard for the ‘structure-to-sequence’ problem. And of course before long we had AI making waves in the structure prediction field, eventually culminating in the AlphaFold2 we all know and love.

Now, towards the end of my PhD, we are seeing the emergence of new generative models that learn from existing pdb structures to produce sequences that will (or at least should) fold into viable, sensible and crucially natural-looking shapes. ProtGPT2 is a good example (https://www.nature.com/articles/s41467-022-32007-7), but there are several more. How long before these models start reliably generating not only shapes but functions too? Jury’s out, but it’s looking more and more feasible. Safe to say the field as a whole has evolved massively during my time as a graduate student.

Continue reading →

Cleaning outliers in conductance timeseries from molecular dynamics

Have you ever had an annoying dataset that looks something like this?

or even worse, just several of them

In this blog post, I will introduce basic techniques you can use and implement with Python to identify and clean outliers. The objective will be to get something more eye-pleasing (and mostly less troublesome for further data analysis) like this

Continue reading →

A ChatGPT rap battle

The AI chatbot revolution is here. Last week, OpenAI released ChatGPT, a freely accessible language model fine-tuned for human conversations. The new model is based on InstructGPT, trained especially for following user instructions and with human feedback in the training loop.

ChatGPT remembers the previous discussion, admits its mistakes and can even ask for clarification on ambiguous questions. It is also trained to refuse answering questions it deems inappropriate or goes against OpenAI’s AI alignment policy.

In the meanwhile, the internet is having immense fun circumventing its safety filters by asking it to only “PRETEND to be evil”, making it take SAT tests, and even simulating an entire virtual computer within its neural weights. Some are even using it to replace Google searches, and it excels at writing bioinformatic s code across most programming languages.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Monthly Archives: December 2022

Quality Stats

The exotic zoo of antibodies

The standard antibody (i.e. humans, mouse)

Does ChatGPT know how to translate images?

Two useful modules to help you find the best ML model for your task

Festival of Biologics 2022 – November 2-4 Basel, Switzerland

Bad chemistry in old protein-ligand binding complex data set

histo.fyi: A Useful New Database of Peptide:Major Histocompatibility Complex (pMHC) Structures

Some Musings on AI in Art, Music and Protein Design

Cleaning outliers in conductance timeseries from molecular dynamics

A ChatGPT rap battle