NeurIPS 2021 Conference Feedback

Held annually in December, the Neural Information Processing Systems meetings aim to encourage researchers using machine learning techniques in their work – whether it be in economics, physics, or any number of fields – to get together to discuss their findings, hear from world-leading experts, and in many years past, ski. The virtual nature of this year’s conference had an enormously negative impact on attendees’ skiing experiences, but it nevertheless was a pleasure to attend – the machine learning in structural biology workshop, in particular, provided a useful overview of the hottest topics in the field, and of the methods that people are using to tackle them. 

This year’s NeurIPS highlighted the growing interest in applying the newest Natural Language Processing (NLP) algorithms on proteins. This includes antibodies, as seen by two presentations in the MLSB workshop, which focused on using these algorithms for the discovery and design of antibodies. Ruffolo et al. presented their version of a BERT-inspired language model for antibodies. The purpose of such a model is to create representations that encapsulate all information of an antibody sequence, which can then be used to predict antibody properties. In their work, they showed how the representations could be used to predict high-redundancy sequences (a proxy for strong binders) and how continuous trajectories consistent with the number of mutations could be observed when using umap on the representations. While such representations can be used to predict properties of antibodies, another work by Shuai et al. instead focused on training a generative language model for antibodies, able to generate a region in an antibody based on the rest of the antibody. This can then potentially be used to generate new viable CDR regions of variable length, better than randomly mutating them.

These two methods are good examples of how the current wave of NLP methods can be used for the discovery (e.i. finding binders from repertoire samples) and design (e.i. generating new CDR loops) of therapeutic antibodies.

Another hot topic in AI for drug discovery covered generously in the conference was the use of generative models to predict compounds that strongly bind to protein pockets. While in the past five years we have seen an explosion of generative models for this purpose, very few considered structural information about the protein binding site and in none of them the generative process occurred in the context of the protein pocket. This limitation has been addressed in two 2021 NeurIPS papers by generating the compound and their coordinates simultaneously.
The first paper, by Luo S et al., proposed an auto-regressive model in which, given the atoms of the protein and the atoms of the generated molecule at time t (no molecule atoms at t=0), a Graph Neural Network computes which is the probability of an atom of type X to be located in a particular 3D coordinate. From that probability density estimation, the method samples an atom type and 3D point and continues the process of estimating new probabilities and sampling a new atom until the GNN decides that there is no more room for new atoms to be added. For training, the network was asked to regenerate bound molecules that were masked, that is, some of their atoms were virtually removed. The second paper, by Drotar P et al., employs a Variational Auto-Encoder (VAE) in which the encoder, which is a GNN, computes a latent representation of the molecule at time t and the decoder decides which is the next atom type, bond type, and coordinates to be added to the molecule. As in the other paper, the protein pocket is also considered by the system.

The two papers propose conceptually similar strategies, in which molecules are generated stepwise and they report good performance in classical benchmarks such as DUD-E. However, due to the very well-known biases that small molecule datasets tend to exhibit, the big question that remains is how well they will perform in the wild.

Another prevalent topic was the determination of protein atomic structures. Despite significant advances being made in high-resolution protein reconstruction, the majority of cryo-EM experiments only provide insight into a small number of conformations of the protein in question. One of the many presentations in the machine learning in structural biology workshop presented a solution to this: a fully end-to-end deep learning system that reconstructs the atomic heterogeneity of proteins using variational auto-encoders.
In their paper, Rosenbaum et al. explain that the model was trained to capture the distribution of conformations in two stages: first, by modeling the full-forward process that generates an image. Practically, this comprises a decoder that takes latent variables encoding conformation and pose and passes them to atomic coordinates. Finally, an image is outputted via a model that simulates the cryo-EM image acquisition process. Secondly, they trained an inverse model that predicts a posterior over the latent variables from a noisy EM sample. Once the model is trained, the prior over latent space matches the accumulated posterior from the training images. This prior can hence be used to learn about the distribution of protein conformations in atom coordinate space. Overall, the success of this method is a promising sign that deep learning techniques can allow the reconstruction of atomic diversity directly from measured data, and gives a way to directly combine experimental data with very strong priors from machine learning.

Finally, in addition to a range of methodological advancements, the increasing interest in applying machine learning algorithms to problems from the biomedical sciences is additionally reflected by multiple frameworks published as part of the Datasets and Benchmarks track. As many datasets that are commonly employed to evaluate algorithmic advances have well-known flaws that limit their transferability to practical applications, composing and distributing ever more challenging benchmarks promises to accelerate the pace with which tangible improvements can be realised.

Representative examples include:
The ATOM3D dataset, which is a collection of five existing and three novel datasets, highlighting the need for benchmarks that require models to reason over three-dimensional representations of different chemical and biological entities
The FLIP benchmark, which is a collection of three existing protein sequence datasets meant for the evaluation of models for inferring different protein fitness landscapes
The Therapeutics Data Commons suite, which encompasses more than 60 existing datasets from different domains across the drug discovery pipeline, aiming to provide machine learning researchers with more realistic settings in which to evaluate new models

All of these dataset collections include boilerplate PyTorch utilities for loading and splitting their data and represent a step in the right direction, aiming to concentrate the resources that flow into computer-aided approaches to drug design on problems of high practical relevance.

Written by Ruben, Tobias, Leo and Lucy

Author