AlphaFold 2 is here: what’s behind the structure prediction miracle

Nature has now released that AlphaFold 2 paper, after eight long months of waiting. The main text reports more or less what we have known for nearly a year, with some added tidbits, although it is accompanied by a painstaking description of the architecture in the supplementary information. Perhaps more importantly, the authors have released the entirety of the code, including all details to run the pipeline, on Github. And there is no small print this time: you can run inference on any protein (I’ve checked!).

Have you not heard the news? Let me refresh your memory. In November 2020, a team of AI scientists from Google DeepMind  indisputably won the 14th Critical Assessment of Structural Prediction competition, a biennial blind test where computational biologists try to predict the structure of several proteins whose structure has been determined experimentally but not publicly released. Their results were so astounding, and the problem so central to biology, that it took the entire world by surprise and left an entire discipline, computational biology, wondering what had just happened.

Now that the article is live, the excitement is palpable. We have 70+ pages of long-awaited answers, and several thousand lines of code that will, no doubt, become a fundamental part of computational biology. At the same time, however, we have many new questions. What is the secret sauce before the news splash, and why is it so effective? Is it a piece of code that the average user can actually run? What are AlphaFold 2’s shortcomings? And, most important of all, what will it mean for computational biology? And for all of us?

In this commentary, which aims to be a continuation of my blog post from eight months ago, I try to address some of these questions. First, I provide a bird’s eye overview of the AlphaFold 2 architecture. This is not meant to be a technical exposition (the SI is as detailed as you could wish, and even the code cites different sections of it), but focuses on the intuition behind the architecture. I want this to reach people without a background in either deep learning and bioinformatics who want to know what’s going on; and those who may have the right background, but want an overview of the full paper before diving right into it.

Following the cold, stone-hard facts, I give a completely personal assessment of the ideas behind the architecture. Namely, I explain which ideas I think were key to the success of AlphaFold 2, and speculate which factors made this team succeed where so many others have fallen short. I am a person of strong opinions, but nevertheless happy to declare that my thoughts may be going in completely the wrong direction. Still, I think the story of AlphaFold 2 raises a lot of questions that we have not addressed as a community and that deserve appropriate consideration somewhere.

Finally, I revisit some of the questions that I raised eight months ago. Some of these questions have been answered by the paper, or by the code (e.g. what are the limitations to run the code). Some others are not solved explicitly, but I have had a chance to reflect upon them more deeply and I think I have some novel insight. And some others are matters that have arisen from the new information, and that I think we will have to answer together.

I have promised myself that I will be more succinct this time — after all, in a few months I should be writing up my PhD thesis and I really don’t have much time to spare. Let’s see if I manage.

First act: how does AlphaFold 2 work?

Prelude

Until Thursday morning, the best answer we had was an image, published in DeepMind’s press release back in November. This schema made the rounds of the internet at the time, and has been featured in a multitude of conferences and discussion groups ever since. But, sadly, it was lacking in details, and even the most knowledgeable deep learning experts were only able to make educated guesses.

Diagram of AlphaFold 2 as published in DeepMind’s blogpost in November 2020.

The Nature article provides a very similar, but slightly more detailed diagram that outlines the different pieces of the architecture.

Diagram of AlphaFold 2 as published in the official Nature paper in July 2021. We have added the separation red lines for convenience.

The overarching idea is quite simple, so I will try to sketch it in a few lines. If you are not familiar with deep learning, the following might sound slightly abstract, and that is perfectly fine. I will take you through the details later. For now, though, let us try to get a schematic picture of the network. For clarity, I have divided the image into thirds which represent the three main parts of the AlphaFold 2 system.

First of all, the AlphaFold 2 system uses the input amino acid sequence to query several databases of protein sequences, and constructs a multiple sequence alignment (MSA). Put simply, an MSA identifies similar, but not identical, sequences that have been identified in living organisms. This enables the determination of the parts of the sequence that are more likely to mutate, and allows us to detect correlations between them. AlphaFold 2 also tries to identify proteins that may have a similar structure to the input (“templates”), and constructs an initial representation of the structure, which it calls the “pair representation”. This is, in essence, a model of which amino acids are likely to be in contact with each other.

In the second part of the diagram, AlphaFold 2 takes the multiple sequence alignment and the templates, and passes them through a transformer. We will talk about what a transformer entails later, but for now you can understand it as an “oracle” that can quickly identify which pieces of information are more informative. The objective of this part is to refine the representations for both the MSA and the pair interactions, but also to iteratively exchange information between them. A better model of the MSA will improve the network’s characterization of the geometry, which simultaneously will help refine the model of the MSA. This process is organised in blocks that are repeated iteratively until a specified number of cycles (48 blocks in the published model).

This information is taken to the last part of the diagram: the structure module. This sophisticated piece of the pipeline takes the refined “MSA representation” and “pair representation”, and leverages them to construct a three-dimensional model of the structure. Unlike the previous state-of-the-art models, this network does not use any optimisation algorithm: it generates a static, final structure, in a single step. The end result is a long list of Cartesian coordinates representing the position of each atom of the protein, including side chains.

So, to recap: AlphaFold 2 finds similar sequences to the input, extracts the information using an especial neural network architecture, and then passes that information to another neural network that produces a structure.

One last piece is that the model works iteratively. After generating a final structure, it will take all the information (i.e. MSA representation, pair representation and predicted structure) and pass it back to the beginning of the Evoformer blocks, the second part of our diagram. This allows the model to refine its predictions, and also produce some funny videos that you can find in the article’s page.

Sounds easy, right? Let us go through the details then.

Preprocessing

Like most bioinformatics programs, AlphaFold 2 comes equipped with a “preprocessing pipeline”, which is the discipline’s lingo for “a Bash script that calls some other codes”. The pipeline runs a number of programs for querying databases and, using the input sequence, generates a multiple sequence alignment (MSA) and a list of templates. Every program has a slightly different script, but AlphaFold 2’s is not too different from your garden variety protein structure prediction preprocessing pipeline.

It is worth explaining the meaning of a multiple sequence alignment. In an MSA, the sequence of the protein whose structure we intend to predict is compared across a large database (normally something like UniRef, although in later years it has been common to enrich these alignments with sequences derived from metagenomics). The underlying idea is that, if two amino acids are in close contact, mutations in one of them will be closely followed by mutations of the other, in order to preserve the structure.

Schematic of how co-evolution methods extract information about protein structure from a multiple sequence alignment (MSA). Image modified from doi: 10.5281/zenodo.1405369, which in turn was modified from doi: 10.1371/journal.pone.0028766

Consider the following example. Suppose we have a protein where an amino acid with negative charge (say, glutamate) is near to an amino acid with positive charge (say, lysine), although they are both far away in the amino acid sequence. This Coulombic interaction stabilises the structure of the protein. Imagine now that the first amino acid mutates into a positively charged amino acid — in order to preserve this contact, the second amino acid will be under evolutionary pressure to mutate into a negatively charged amino acid, otherwise the resulting protein may not be able to fold. Of course, real situations are rarely as clear-cut as this example, but you get the idea.

Finding templates follows a completely different, but closely related principle. The philosophy behind template modelling may be encoded in the maxim “there is nothing new under the sun”. Proteins mutate and evolve, but their structures tend to remain similar despite the changes. In the image below, for example, I display the structure of four different myoglobin proteins, corresponding to different organisms. You can appreciate that they all look pretty much the same, but if you were to look at the sequences, you would find enormous differences. The protein on the bottom right, for example, only has ~25% amino acids in common with the protein on the top left.

Protein structures of human myoglobin (top left), african elephant myoglobin (top right, 80% sequence identity), blackfin tuna myoglobin (bottom right, 45% sequence identity) and pigeon myoglobin (bottom left, 25% sequence identity)

In most cases, however, conservation occurs on a smaller scale, where pieces of the protein (say, the active centre of an enzyme) remain mostly unchanged while their surroundings evolve. Size does not really matter: using the right methods, it is possible to identify some of these conserved fragments and use them as a guide to construct the structure. This has been such an important ingredient in structural prediction that targets in CASP14 have classically been classified depending on the number of templates available.

Is there anything especial in here? Not really: most of the participants at CASP14 followed very similar strategies. The idea of using correlated mutations to extract structural information from an MSA is decades old, and collecting pieces of other proteins to model your target’s structure is perhaps older even. I would say that, so far, there is nothing new.

The Evoformer (evolutionary transformer?) module

Here is where the story really gets going. The first section of the gigantic AlphaFold 2 neural network, the Evoformer, has the task of squeezing every ounce of information out of the multiple sequence alignment and the templates.

You may not be surprised to hear that extracting information from the multiple sequence alignment (“coevolutionary analysis”) has been a prime pursuit of structural bioinformatics for years. People started looking at it in the nineties, although with limited success. At the beginning of last decade, several groups started to identify a number of biases that had stymied prior attempts, and developed powerful statistical machinery to correct them. There was some consistent progress for several years. And then, in CASP13 (2018), several groups demonstrated that there was actually no need for robust statistics: you just needed to train deep residual neural networks.

AlphaFold 2’s Evoformer completely reinvents this process and takes it several steps further.

The central idea behind the Evoformer is that the information flows back and forth throughout the network. Before AlphaFold 2, most deep learning models would take a multiple sequence alignment and output some inference about geometric proximity. Geometric information was therefore a product of the network. In the Evoformer, instead, the pair representation is a both a product and an intermediate layer. At every cycle, the model leverages the current structural hypothesis to improve the assessment of the multiple sequence alignment, which in turns leads to a new structural hypothesis, and so on, and so on. Both representations, sequence and structure, exchange information until the network reaches a solid inference.

This is easier to understand as an example. Suppose that you look at the multiple sequence alignment and notice a correlation between a pair of amino acids. Let’s call them A and B. You hypothesise that A and B are close, and translate this assumption into your model of the structure. Subsequently, you examine said model and observe that, since A and B are close, there is a good chance that C and D should be close. This leads to another hypothesis, based on the structure, which can be confirmed by searching for correlations between C and D in the MSA. By repeating this several times, you can build a pretty good understanding of the structure.

Conceptualization of the Evoformer information. In the left diagram, the MSA transformer identifies a correlation between two columns of the MSA, each corresponding to a residue. This information is passed to the pair representation, where subsequently the pair representation identifies another possible interaction. In the right diagram, the information is passed back to the MSA. The MSA transformer receives an input from the pair representation, and observes that another pair of columns exhibits a significant correlation.

Okay, this is a simple intuition. Now, how does it work in practice?

The first step in the network is to define “embeddings” for the MSA and templates. Bear in mind that multiple sequence alignments are ultimately sequences of symbols from a finite alphabet: a prime example of a discrete variable. Neural networks, on the other hand, are intrinsically continuous devices that rely on differentiation to learn from their training set. An “embedding” is a trick from the deep learning magic book that allows the transformation of a discrete variable to a continuous space (“embedded space”) so that the network can be trained.

Complicated as this may sound, it is in fact very simple. You just need to define a layer of neurons that receives the discrete input and outputs some continuous vector. The embedding may be pretrained, as used to be common in natural language processing (NLP), but more commonly it is trained alongside whatever objective we are trying to learn. In AlphaFold 2, the embeddings are vanilla dense neural networks.

Visualization of the Word2Vec embedding. We selected the word “protein” and marked a number of words that show high similarity. This image was created using the TensorFlow Embedding Projector.

Once our MSA and templates are in the correct “embedding space”, it is time for the Evoformer to work its magic. To understand the Evoformer, you first need to be familiar with the hottest deep learning architecture to date: the transformer. There is no shortage of material explaining this architecture, and quite frankly, many if not most will be better than mine. If you are interested in an in-depth analysis, I would recommend The Illustrated Transformer, by Jay Alammar. If you just want to know the minimum, read below. And, if you already know transformers like the palm of your hand, hit Ctrl+F and find the sentence “Back to the Evoformer” a few paragraphs below.

The transformer architecture was introduced in 2017 by a team at Google Brain, in a paper entitled “Attention is all you need“. As you will probably imagine from the conspicuous title, the key ingredient is a novel mechanism called attention. The objective of attention is to identify which parts of the input are more important for the objective of the neural network. In other words, to identify which parts of the input it should pay attention to.

Imagine that you are trying to train a neural network to produce image captions. One possible approach is to train the network to process the whole image — say ~250k pixels in a 512×512 picture. This may work, but there are some reasons why it is not the best idea. Perhaps first of all, because this is not what we humans do: when we look into a picture, we do not see it “as a whole”. Instead, we segment it into different patterns: a child, a dog, a frisbee. Luckily, it turns out that we can train a cleverly-designed neural network layer to perform this task. And, empirically, it seems to improve the performance by a lot.

This image has been taken from Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, by Kelvin Xu et al. (2016)

Image captioning is not the only example, by far. More related to the original “Attention is all you need” paper is the case of machine translation, where attention reveals which parts of the sequence are important for the current part of the translation.

An attention matrix in the context of a machine translation problem. The model learns which words are relevant for what part of the translation. For example, the French words “marin” and “environnement” have a high attention coefficient with the English words “marine” and “environment”. This image has been taken from Attention and its different forms, by Anusha Lihala.

There are a lot of other reasons why transformers are in common use. In machine translation, for example, they help ameliorate the “vanishing gradient problem”, a common hurdle during training. In sequence-based models, they can significantly speed up training with respect to classical recurrent neural network (RNN) models. Most important of all, transformers have empirically demonstrated superior performance in a variety of tasks. In particular, they are behind most of the splashiest AI achievements of the past year — for example, the GPT in GPT-3 stands for “Generative Pre-training Transformer”.

I do feel I have to mention one last disclaimer before I consider my explanation complete. There is a reason why transformers have not been widely implemented in many fields. The construction of the attention matrix leads to a quadratic memory cost. Even if you have a last-generation NVIDIA A100 with an 80 GB dedicated memory, that can get filled very quickly. Unsurprisingly, one of the crucial advantages of Google’s celebrated Tensor Processing Units (TPUs) is the massive amount of memory per core. This situation may change very quickly with the introduction of novel architectures such as the Performer or the Perceiver, which reduce this quadratic cost to a pseudo-linear one. But, anyway, I am getting ahead of myself.

Back to the Evoformer. The Evoformer architecture uses not one, but two transformers (a “two-tower architecture”), with one clear communication channel between the two. Each head is specialised for the particular type of data it is looking at, either a multiple sequence alignment, or a representation of pairwise interactions between amino acids. They also incorporate the information of the contiguous representation, allowing for regular exchange of information and iterative refinement.

Let us first look at the head that attends to the MSA, which I will term the “MSA transformer”, in honour to the February paper by Facebook AI Research which implements a similar idea. The MSA transformer computes attention over a very large matrix of protein symbols. To reduce what would otherwise be an impossible computational cost, the attention is “factorised” in “row-wise” and “column-wise” components. Namely, the network first computes attention in the horizontal direction, allowing the network to identify which pairs of amino acids are more related; and then in the vertical direction, determining which sequences are more informative.

The most important feature of AlphaFold 2’s MSA transformer is that the row-wise (horizontal) attention mechanism incorporates information from the “pair representation”. When computing attention, the network adds a bias term that is calculated directly from the current pair representation. This trick augments the attention mechanism and allows it to pinpoint interacting pairs of residues.

The other transformer head, the one that acts on the pair representation, works in a similar manner, although a lot of details differ, of course. The key feature of this network is that attention is arranged in terms of triangles of residues. The intuition here is to enforce the triangle inequality, one of the axioms of metric spaces. This is quite a clever idea since one of the classical problems of deep learning-based structure prediction was that distance distributions could not be embedded in three-dimensional space. It seems this trick fixes that and then some more.

Triangular attention, as published in the Nature paper.

After a number of iterations, 48 in the paper, the network has built a model of the interactions within the protein. Now, it is the time to build a structure.

The structure module

If you got this far, you probably know by now that AlphaFold 2 processes the sequence search data to generate two “representations”: a representation of the multiple sequence alignment (MSA), which captures sequence variation; and a representation of the “pairs of residues”, which captures which residues are likely to interact with each other. The question is now: how do we get a structure from these? This is a job for the structure module

The idea behind the structure module is conceptually very simple, but it gets muddy through the many, many details of the implementation.

The structure module considers the protein as a “residue gas”. Every amino acid is modelled as a triangle, representing the three atoms of the backbone. These triangles float around in space, and are moved by the network to form the structure.

Training process
The “residue gas” approach. Image taken from the OpenFold 2 webpage, by Georgy Derevyanko.

These transformations are parametrised as “affine matrices”, which are a mathematical way to represent translations and rotations in a single 4×4 matrix:

Spatial Transformation Matrices
Mathematical representation of an affine transformation matrix. The 3×3 part corresponds to a rotation matrix, and the 1×3 column is the displacement vector. Multiplication of an affine matrix times a vector is equivalent to displacing and subsequently rotating said vector. Image taken from BrainVoyager.

At the beginning of the structure module, all of the residues are placed at the origin of coordinates. At every step of the iterative process, AlphaFold 2 produces a set of affine matrices that displace and rotate the residues in space. This representation does not reflect any physical or geometrical assumptions, and as a result the network has a tendency to generate structural violations. This is particularly visible in the supplementary videos, that display some deeply unphysical snapshots.

The secret sauce of the structure module is a new flavour of attention devised specifically for working with three-dimensional structures — DeepMind calls it “Invariant Point Attention” (IPA). This attention mechanism benefits from a property that they had already announced in November: invariance to translations and rotations. Why is this invariance such a big deal? You may understand this as a form of data augmentation: if the model knows that any possible rotation of translation of the data will lead to the same answer, it will need a lot less data to pull it away from wrong models and will therefore be able to learn much more.

It turns out that AlphaFold 2’s attention mechanism is much simpler than the SE(3)-equivariant transformer that underlies RoseTTAFold and that most groups have been experimenting with. In fact, it is based on a very simple mathematical fact: that the L2-norm of a vector is invariant with respect to translations and rotations. Unfortunately, an explanation of Invariant Point Attention (IPA) is well-beyond the scope of this blog post, as it would require to talk about keys, queries and values, and some other machine learning witchery. If you read section 1.8 of the Supplementary Information you will find all the information you need to understand it.

It is also worth noting that the Structure Module also generates a model of the side chains. To simplify the system, their positions are parametrised by a list of torsion angles, which are predicted in their normal form by the network, and implemented with standard geometric subroutines.

Dihedral angles in glutamate: Dihedral angles are the main degrees of... |  Download Scientific Diagram

And this is most of it. After a couple of iterations, you will get a structure prediction.

And there is more…

What precedes is a hand-wavy explanation of the scaffold of AlphaFold 2, aiming to sidestep the intricate engineering of the network in favour of big picture intuition. But there are many other interesting details that should be taken into consideration.

One of said details is the loss function used by AlphaFold 2. The DeepMind team introduced a specific structural loss which they called FAPE (Frame Aligned Point Error), which we could understand as a clever version of the more commonly used root-mean-squared deviation (RMSD) of atomic positions. It also comes with an added property: it is not invariant to reflections and thus it prevents creating proteins of the wrong chirality.

It seems this function was not enough. The final AlphaFold 2 loss function is in fact a weighted sum of multiple “auxiliary losses”, which do not necessarily relate to the performance of the model, but that provide additional information. For example, the DeepMind team computed the loss not only of the final output, but also of each of the three iterations of the structure module. It also includes a “distogram loss”, where the predicted structure is used to generate a 2D matrix of distances that is also compared to the ground truth.

An auxiliary loss that I found particularly interesting is the “MSA masking”. At every step of training, the model is given a multiple sequence alignment with some symbols “masked out”, and asked to predict these symbols. This is just a modality of self-supervised learning, as popularised by BERT and other models; but surprisingly, it is used in both inference and training.

That is not the only sleight-of-hand. One of the cleverer tricks the team pulled off is self-distillation. In this approach, they took a model trained exclusively on the PDB, and predicted the structures of ~300k diverse protein sequences obtained from Uniclust. They then retrained the full model, incorporating a small random sample of these structures at every training cycle. They claim this allows the model to leverage the large amount of unlabelled data available in protein sequence repositories. (Correction: on a previous version of this article I mistakenly mentioned they implemented this trick after CASP14, when in fact they used the “self-distilled” model in the assessment).

This, and many other tricks, are described in exhaustive detail in the Supplementary Information. A reduced subset has been analysed in a brief ablation study, but ultimately, how important are each of the minor details is anybody’s guess.

Second act: it’s all about the goss

Feelings and other stuff

I will be honest with you: all in all, after I read the paper for the first time, my feelings were very clear: I felt quite disappointed.

Do not get me wrong. I think this is one of the most important scientific achievements of this century, and I do not have enough superlatives to qualify how impressive it is. Now that the paper is published and the code has been open sourced, I would expect DeepMind’s team to be showered with accolades in the following years. I truly and wholeheartedly believe they deserve them.

No, what really disappoints me is that there are no surprises. I have spent a good part of the past year talking to people about what AlphaFold 2 might be doing. I have talked to the OpenFold 2 people, and followed some of the people on EleutherAI, who have been trying to reproduce the code. When I chatted to Georgy Derevyanko (who led OpenFold 2 and knows way more about deep learning than I do) right after CASP14, he pretty much uncovered the main elements of the architecture they are reporting today.

Throughout the past year I have been wondering if DeepMind might have arrived at some powerful insight about protein structure. Was that not the reason behind their secretism? Could they have uncovered some astoundingly clever principle about protein folding that they could directly encode into their models? Had they managed to see further than anyone else?

Halfway through the paper, and quite frankly, after a few glances at the Supplementary Information, I realised what was the secret sauce. The incredible performance of this network seems down to DeepMind’s superb engineering. The ideas in the model, clever as they are, correspond strictly to what they disclosed in their CASP14 presentation in November. It is their access to compute resources, and their top-notch know-how, that turned them into the successful neural network it became.

Why is engineering so important?

If you do not think engineering played a key role in the development of AlphaFold 2, I invite you to have a look at the Supplementary Information. It is replete with images like the following, representing innumerable combinations of matrix products.

One does not just come up with an architecture like this, independently of the amount or quality of the scientists involved. Even if they started to design AlphaFold 2 with the current ideas — something that I very strongly doubt –, each and every piece of the puzzle must have gone through many cycles of trial and error.

Why does the MSA transformer use gated attention, while the other transformers do not? Why does the MSA representation learn from the pair representation as an input, while the latter incorporates the former as an outer product sum? Why do they use so many losses? These, and many other details are probably due to intensive experimentation. One can only imagine how many architectures they discarded, or how many times they had to tweak an architecture until it finally worked.

This leaves one to wonder exactly how much computational power went into this project. In their original announcement, DeepMind claimed that they used “128 TPUv3 cores or roughly equivalent to ~100-200 GPUs”. Although this amount of compute seems beyond the wildest dreams of most academic researchers, my friends at Google tell me it is not uncommon for individual Googlers to access similar resources on a regular basis. How many times more computational power did this team have, in comparison with all the other CASP14 contenders combined?

Again, do not mistake me. This achievement is by no means a product of computational power or mindless experimentation alone. There are clever and beautiful ideas throughout. Nevertheless, it would be unreasonable to forget that this achievement would have not been possible without an enormous investment supporting it. And that if an academic group had come up with similar ideas, they simply would not have had the resources and engineering know-how to implement them in practice.

Third act: so what? And, what now?

Will DeepMind open source their AlphaFold 2 code?

Yes! It is here.

Is AlphaFold 2 a code that anyone can run?

There won’t be an Android app anytime soon, but anyone should be able to run it, provided they are willing to invest in a powerful compute server (or the cloud).

If you, like me, have a standard laptop with a small GPU, then you will probably struggle. After painstakingly downloading and extracting the 2.5 TB of databases (mostly the 2 TB of the Big Fat Fantastic Database), I would expect something like this:

Output of running AlphaFold 2 on a Dell Inspiron 15 7000 laptop with a NVIDIA GeForce GTX 1650. Note that I get similar errors when I run the code with the –nouse_gpu option, likely because my RAM is also quite limited.

On the other hand, if you have access to standard computing servers, then you should be able to run it without much trouble. I have tested it on a Quadro RTX 6000 GPU, (~20 GB of dedicated memory) and on a CPU server with ~300 GB of RAM, and in both cases I was able to obtain structure predictions. In particular, I have run a bunch of proteins of up to 600 amino acids, and I seem to have been able to produce an answer in every case.

To DeepMind’s praise, they have put some effort into making their code accessible. The code is released with a Docker image and a matching launcher script, which make installing the code very easy. If you are using an HPC system, it is possible to do some minimal changes and run it under Singularity — I will probably write a post about that very soon.

One can only speculate what may have convinced DeepMind to publicly release the code, instead of exploiting it commercially. Whatever the reason, they decided to do the right thing and they deserve credit for that. (Update: a previous version of this article suggested that the release of RoseTTAFold might have put pressure on DeepMind to release the full code; however, a reputable source has confirmed that they already promised the open source code in the submitted manuscript on May 11th, much before RoseTTAFold was even published as a preprint)

What are the limitations of AlphaFold 2?

This is the question that is on everyone’s head right now, and we may not be able to answer it completely for some time. Someone needs to run independent assessments as soon as possible.

“Wait”, you may be asking, “isn’t the paper’s whole point to show that AlphaFold 2 works?”. Yes, that is correct. In particular, the DeepMind team has shown that the impressive results of CASP14 seem to extend to a large sample of recent PDB structures.

However, there is more than meets the eye. AlphaFold depends on a multiple sequence alignment as input, and it remains to be seen if it can tackle problems where these are shallow or not very informative, as happens with designed proteins or antibody sequences. Scientists working on those, and other areas, will have to run independent benchmarks to verify if AlphaFold 2 is actually able to predict those proteins well. For now, the DeepMind team suggests that “accuracy drops substantially when the mean alignment depth is less than ~30 sequences”.

Out of curiosity, I decided to run a simple test case with a designed protein (Top7, PDB: 2MBM) and an antibody (PDB: 7MFB). In both cases, I asked the program to ignore any template corresponding to structures published after 2010. This is an interesting, but highly unrigorous assessment of how it could perform on both families.

The prediction of the designed protein is surprisingly good, despite the fact that no templates are available and the multiple sequence alignments has less than 10 sequences. The fold topology is definitely consistent with the NMR structure; the TMscore is 0.58, and the all-atom RMSD is 3.5 A after ignoring the poly-histidine tag from the C-terminal loop.

Highest-score NMR structure of the Top7 protein (PDB: 2MBM, orange) superposed with the AlphaFold 2 prediction (blue).

Is it the mind-blowing prediction accuracy that DeepMind boasts in their presentations? Probably not. Nonetheless, it is a good quality prediction, and possibly as useful as the NMR structure itself.

What about antibodies? I went into the PDB and pulled the latest entry that contained the word “antibody” and the tag “Homo sapiens’. I gave AlphaFold 2 the sequence of the heavy chain. Unsurprisingly, this provides a very large alignment with tons of contributions from all of the databases.

It is interesting that, regardless of alignment depth and template availability, the prediction of the antibody is comparatively much worse. The all-atom RMSD of the entire chain is 5.2 A and the TMscore is only 0.3, right above the cut-off for a non-random prediction. The structural part of the heavy chain (left) is particularly bad, especially given that this part of the antibody can be predicted with much accuracy with a homology model.

Heavy chain portion of the crystal structure of an antibody (PDB: 7MBF, orange) superposed with the AlphaFold 2 prediction (blue).

In comparison, if we focus on the complementarity-determining region (CDR), the result is surprisingly much better, with an all-atom RMSD of just 0.4 A.

Complementarity-determining region (CDR) of the crystal structure of an antibody (PDB: 7MBF, orange) superposed with the AlphaFold 2 prediction (blue).

The majority of the CDR seems to be predicted almost to perfection. The only area that shows a significant difference is the H3 loop, which is known to be the most difficult part to predict.

What can we learn from this quick experiment? We cannot guarantee that AlphaFold 2 will be able to emulate its breathtaking accuracy across all classes of protein structure prediction problems. However, it does not seem to be doing anything crazy. Even if the standard model does not work, it seems reasonable that with the right tweaking, the same ideas that brought as here will do the trick again.

What will this mean for academic research?

After thinking about it for the best part of this year, I would like to make a point that DeepMind’s achievement may become an indictment of academia.

Everyone speaks about the many flaws of academia. If you follow the Twitter banter, it seems that academics mostly talk about how much they dislike being academics. We have an unhealthy work culture, rampant inequality, and work conditions that sometimes defy reason. Even the long-lauded job benefits like “a job for life, once you are a professor” or “you can research what you truly want” are mostly gone amidst the increasing mercantilism of higher education. There have been many attempts to change this system, but ultimately, none of them seem to have worked.

I think there is something that may pose a true challenge to the system: market forces.

Academia has long been the place to carry out basic research. Sure, there have been a few industrial labs engaging in “blue skies” projects all the way since Bell Labs, but these have been the exception rather than the norm. For a long time, if you wanted to solve “basic research” problems, like protein structure prediction, the path was clear: find a job within the academic system. Fast forward to 2021, and we find that some of the best basic research in computational biology, and several other fields, is carried out by the R&D branches of FAAMG companies. Not only that, but the work conditions are much better.

What argument is there for a young researcher to embark on the path towards academia then? How do we convince someone to endure years of temporary contracts, bad pay, non-existent work-life balance and general instability, only to discover that researchers in industrial labs, who enjoy all the perks they would sacrifice “for science”, are also the only ones with the right resources to carry out the research they really want to do? You may argue that this conundrum has existed for a long time. But the difference is that, if previously industry was a good place to do “applied” research, it may soon become the place to do all research. And that could deal an enormous blow to the academic system.

This perspective should raise flaring red alarms at every government institution. Think of the CoVID-19 pandemic, for example. In the UK, the government’s egregious incompetence has only been tempered by a council of leading scientists providing the best evidence-based advice available. This conclave only works because you can find some of the best scientists housed in the higher education system, not representing any private interests. But, what will happen if the best and the brightest end up snatched away by industry? You simply cannot assemble a similar council of Big Pharma (or Big AI?) executives and hope they have no personal interests. It also doesn’t work if your advisors are not the best and the brightest, or if they have fallen behind industry because they don’t have even a fraction of the resources needed to catch up.

Another problem: academia has traditionally been the breeding ground for defining new problems. Think of protein structure prediction. It is not a priori obvious how to define the problem, or how to assess it, and it took protein scientists many years to come up with CASP. All of this work defining the problem and developing a critical evaluation was necessary for DeepMind to arrive with fresh ideas and strong computational muscle, and solve the problem. However, what happens if academia loses its lustre? Who will want to come up with new problems, in full knowledge that industry can step in any minute and find a solution before their very eyes? I suspect very few.

I am growing quite dramatic, I know. But the issue still stands. There are many questions to be answered by both academic institutions and funding agencies about how to make academia more competitive and up to date. It has nothing to do with where you came from: it is about bringing equilibrium to the system, letting industry benefit from academia and vice versa.

What will this mean for biology?

When I wrote about this question eight months ago, my response was conditional on whether DeepMind would release the code, and whether this code would be in a format that allowed wide use of it. We now have affirmative responses to both questions, so we should discuss the question in more depth

The release of AlphaFold 2 means that predicting a protein structure from sequence will be, for all practical purposes, a solved problem. Sure, the predictions will not be perfect. For some families of proteins, they will be pretty bad. But, all in all, when a researcher identifies a protein sequence of interest, they should be able to obtain some structural information in a matter of days, if not hours. This is an incredible win for researchers in biology, who will no longer need years and million of dollars to understand the structures of their proteins, but maybe just click a button. Who knows, in time it might even be implemented in the NCBI website, alongside other bioinformatics tools, like BLAST, that are commonly used by computational and experimental biologists alike.

The most obvious direct application of this project is structure-based drug discovery. Until now, the availability of structures was a prime requirement: most people would not even consider starting a project without at least a crystal structure of the unbound protein. However, the availability of high-accuracy predictions, as well as predicted “error bars” will probably encourage the pharmaceutical industry to increasingly use AlphaFold 2’s models for development. And so, we may soon have inhibitors of many drug targets that have remained hitherto unexplored.

Another problem that will receive increased attention is that of protein design. In order to devise a protein with a specific function, it is necessary to ensure that it closely folds to a particular structure. So far, this process has been slowed down by the time it takes to complete a cycle of design, expression and structure determination. However, if our protein structure prediction pipelines are good enough to confirm the topology of a protein without experimental confirmation, this might accelerate the testing cycle. This might lead to novel, artificial proteins, capable of solving a number of fascinating challenges.

More towards my own field, I foresee structural bioinformatics research finally breaking free of the protein structure prediction problem. We will now dedicate our efforts to other, much more interesting problems, that until now may not have received enough attention from our community. I can identify two potential candidates that will keep us busy for some time. One of them is protein dynamics, and all of the phenomena that are related to it: folding and misfolding, aggregation, allostery, flexibility, fold-switching and the like. The other one is binding: to ligands, as in drug discovery, but also between proteins.

As we further our understanding of these phenomena, we will bring the field towards a “structural systems biology”. In this potential future, we will be able to directly model interactions between proteins in the context of the cell, and predict the phenotypic effects of changes in the proteins and the media. This would enable us to understand many diseases whose mechanisms are still unclear — such as Alzheimer’s or Parkinson’s disease, some of the most common proteinopathies.

And, as it always happens, the discovery of new tools will bring forward novel problems, and novel solutions, that we have not considered yet. As cheap and accurate structure prediction enables structural annotation of massive databases of protein sequences, we may discover new facts about biology, or new tools to explore it, that we have not considered before.

Conclusions

Weirdly, I find myself having pretty much the same conclusions that I arrived at eight months ago. We know that AlphaFold 2 “works”, within some limitations, and that its release to the community will stimulate tons of novel research in biology. We know that the success of DeepMind speaks volumes about the way research in the field has been carried out. And we know that we need to reflect carefully about the way we fund, conduct and motivate research. But these are the same things we have been thinking about for several months, so let us talk about some other ones.

The AlphaFold 2 system is a wonder of modern deep learning, including a variety of highly sophisticated components. One of these components, the Evoformer, is able to efficiently extract information from a multiple sequence alignment and build an accurate representation of the parts of the protein in close contact. Another piece, the Structure Module, takes this representation and builds a three-dimensional structure for the protein, including the position of the side chains. Together with a plethora of deep learning tricks, the model produces breathtakingly accurate predictions.

Although all of the ideas in the model are doubtlessly clever, the main secret behind AlphaFold 2’s success is the superb deep learning engineering. A close look at the model reveals an architecture with a large amount of small details that seem fundamental for the performance of the network. As we admire the end product, we should not turn a blind eye to the enormous budget, and the large team of full-time, handsomely paid engineers that made it possible.

The code is publicly available, and it can be run with moderate computational resources. The predictions seem to agree with the paper, and are reasonable even for some complicated cases, like designed proteins and antibodies, where multiple sequence alignments are not necessarily informative. This success will allow researchers in biology to obtain structural information about their proteins almost immediately, and spearhead significant advances in multiple areas of molecular biology. It is a very exciting time to be a computational biologist, I’ll tell you that.

I would like to thank Olly M. Crook for providing extensive comments on the first draft of this text.

Author