Author Archives: Carlos Outeiral Rubiera

AlphaGeometry: are computers taking over math?

Last week, Google DeepMind announced AlphaGeometry, a novel deep learning system that is able to solve geometry problems of the kind presented at the International Mathematics Olympiad (IMO). The work is described in a recent Nature paper, and is accompanied by a GitHub repo including full code and weights.

This paper has caused quite a stir in some circles. Well, at least the kind of circles that you tend to get in close contact with when you work at a Department of Statistics. Like folks in structural in biology wondered three years ago, those who earn a living by veering into the mathematical void and crafting proofs, were wondering if their jobs may also have a close-by expiration date. I found this quite interesting, so I decided to read the paper and try to understand it — and, to motivate myself, I set to present this paper at an upcoming journal club, and also write this blog post.

So, let’s ask, what has actually been achieved and how powerful is this model?

What has been achieved

The image that has been making the rounds this time is the following benchmark:

Continue reading

Understanding GPU parallelization in deep learning

Deep learning has proven to be the season’s favourite for biology: every other week, an interesting biological problem is solved by clever application of neural networks. Yet, as more challenges get cracked, modern research shifts more and more in the direction of larger models — meaning that increasing computational resources are required for training. Unsurprisingly, NVIDIA, the main manufacturer of GPUs, experienced a significant jump in their stock price earlier this year.

Access to compute is not enough to train good neural networks. As soon as multiple cards enter into play, researchers need to use a completely different paradigm where data and model weights are distributed across different devices — and sometimes even different computers. Though these tools start to be crucial for successful computational biology research, they are generally unknown to researchers. Hence, in this blogpost, I would like to provide a really brief introduction to multi-GPU training.

Continue reading

How ChatGPT changed my writing as an ESL speaker

It’s not always easy to live in an Anglophone scientific world when English isn’t your first language. When careers are built upon the ability to communicate ideas clearly and eloquently, struggling to find the right words can be a real hindrance to explain your science in a way that is taken seriously. Contrary to popular belief, it’s not something you can simply “work” on. Often, it doesn’t matter how many books you’ve read, how many years of education you have, or how articulate you are in your original language — your brain will refuse to summon the right expression, or get stuck in a construction that a native speaker would never use. Struggling with a second language is very much a biological phenomenon.

The standard recommendation for ESL (English as a Second Language) speakers has long been to ask a native colleague to read through any text that needs to be published or submitted somewhere (such as an article or a grant application). Well-intentioned as this advice may be, there are multiple problems with it. Lingua franca or not, only 15% of the world population speaks English, of which only 5% are native speakers — meaning that for most scientists not working in Anglophone countries, the option is rarely available. Even when available, it is unreasonable to expect these colleagues to add charitable proof-reading to their workload simply because they happened to be born speaking a different language. But, most importantly, I have always felt — and I want to emphasize that I truly believe most people who issue this kind of advice to be well-intentioned — that the underliying message sounds too much like “you need vetting by a member of our select linguistic club if you want your ideas to be taken seriously“.

Continue reading

Tales of an OPIG Jamboree

Jamboree
(1) a large gathering, as of a political party or the teams of a sporting league, often including a program of speeches and entertainment.;
(2) a large gathering of members of the Boy Scouts or Girl Scouts, usually nationwide or international in scope

Oxford Dictionary

This October marks twenty years since our supreme leader, Charlotte Deane, came to Oxford to start the first protein informatics group in this university.

Twenty years is a really long time, and at OPIG we like to celebrate things in style. From the beginning, it was clear that we would be doing what we know best: get together, consume lots of food and drinks, and perhaps talk about science. But, frankly, that’s what we do all the time. This simply wasn’t enough to celebrate two decades of scientific production. So Charlotte entrusted several of us with an ambitious goal: to reach out to our former members, and to ask them to join us, in Oxford, to celebrate two decades of protein informatics. And that’s what we did.

For two months, we painstakingly tracked down every person that has ever been part of our group, and attempted to gather their contact details to invite them to Oxford. Attempted to, for the most part. While LinkedIn gave us some early victories, some alumni had managed to cover their tracks very well, including one person we could only found after tracking down their three previous jobs. Nevertheless, after much digging, we managed to find updated contact details for every person that has ever passed by our lab, and nearly thirty of these former alumni (almost 50% of them!) made their way to Oxford on October 8th* to hold the first OPIG Jamboree.

From the first student (Sanne Abeln, rightmost in the second row) to the most recent (Kate, whose hair can barely be seen on the leftmost third row), we are all here!
Continue reading

The evolution, evolvability and engineering of gene regulatory DNA

Catching up on the literature is one of the highlights of my job as a scientist. True, sometimes you can be overwhelmed by the amount of information you don’t have; or wonder if we really need another paper showing that protein-ligand scoring functions don’t work. And yet, sometimes you find excellent research that you can’t but regard with a mixture of awe and envy. At a recent group meeting, I discussed one such paper from the research group of Aviv Regev at MIT, where the authors perform an impressive combination of computation and experiment to consider some basic questions in gene regulation and evolution. Here is why I think it’s excellent.

The authors are interested in promoters, small sequences of DNA that precede genes, which are known to regulate how frequently their partners will be expressed. In short, these promoters are binding sites for transcription factors, a family of proteins that in turn recruit RNA polymerase to transcribe DNA to RNA. In turn, albeit not directly, the rate of gene transcription determines the rate at which a protein is produced. If this sounds simple, however, that is where our understanding stops. The human genome encodes some 1.6k different transcription factors (~6-7% of protein-coding genes) and their underworkings are still not well-understood.

Continue reading

Make your code do more, with less

When you wrangle data for a living, you start to wonder why everything takes so darn long. Through five years of introspection, I have come to conclude that two simple factors limit every computational project. One is, of course, your personal productivity. Your time of focused work, minus distractions (and yes, meetings figure here), times your energy and mental acuity. All those things you have little control over, unfortunately. But the second is the productivity of your code and tools. And this, in principle, is a variable that you have full control over.

Even quick calculations, when applied to tens of millions of sequences, can take quite some time!

This is a post about how to increase your productivity, by helping you navigate all those instances when the progress bar does not seem to go fast enough. I want to discuss actionable tools to make your code run faster, and generate more results, with less effort, in less time. Instructions to tinker less and think more, so you can do the science that you truly want to be doing. And, above all, I want to give out advice that is so counter-intuitive that you should absolutely consider following it.

Continue reading

How to estimate the inestimable

Back-of-the-envelope calculations are one of our chief tools as scientists. When you spend most of your time wondering if your latest measurement is correct, having a tool to check if the numbers make sense is simply priceless. If you are lucky, a good estimate might just avoid a costly or laborious measurement — this is very common in disciplines like chemical engineering, which a friend described as “the art of estimating numbers and plugging them into some variation of Bernoulli’s continuity equation”. Unsurprisingly, these Fermi problems are now common interview questions at major consultancy and tech companies, and have even started to go viral.

Last week, I thought I would ask my biochemistry students to solve a back-of-the-envelope problem as part of their tutorial work. Disguised as an enzyme catalysis problem, I asked them to estimate the energy of a single hydrogen bond. Needless to say, they were puzzled. Some of them asked if I had forgotten to include some information in the problem sheet. For some reason, Fermi problems seem to be less common in chemistry and biology that they are in physics of engineering. Of course, estimating the energy of a hydrogen bond is in many ways much harder than guessing the number of ping pong balls that fit a Boeing 747. Nobody has seen a hydrogen bond in the flesh. And our minds struggle to grasp the vast numbers present at the molecular level. Nevertheless, guesstimates are incredibly useful

Continue reading

AlphaFold 2 is here: what’s behind the structure prediction miracle

Nature has now released that AlphaFold 2 paper, after eight long months of waiting. The main text reports more or less what we have known for nearly a year, with some added tidbits, although it is accompanied by a painstaking description of the architecture in the supplementary information. Perhaps more importantly, the authors have released the entirety of the code, including all details to run the pipeline, on Github. And there is no small print this time: you can run inference on any protein (I’ve checked!).

Have you not heard the news? Let me refresh your memory. In November 2020, a team of AI scientists from Google DeepMind  indisputably won the 14th Critical Assessment of Structural Prediction competition, a biennial blind test where computational biologists try to predict the structure of several proteins whose structure has been determined experimentally but not publicly released. Their results were so astounding, and the problem so central to biology, that it took the entire world by surprise and left an entire discipline, computational biology, wondering what had just happened.

Continue reading

Singularity: a guide for the bewildered bioinformatician

Have you ever worked with a piece of software that is awfully difficult to set up? That legacy code written on FORTRAN 77, that other one that requires significant modifications to compile, or any of those that require a long-winded bash script with a thousand dependencies (which you also have to install!). Would it not be helpful if, when that red-eyed PhD student, that one that just spent three months writing up their thesis, says that they absolutely must use that server where you have installed all your stuff, you could just relocate to another one without trouble? Well, you may be able to do that now. You just need to use containerization.

The idea behind containerization is rather simple. The best way to ensure anyone can reproduce your work is to, well, ship your entire system to whomever needs to use it. You could, for example, pack up your desktop in a box, and ship it to your collaborators anywhere in the world. Unfortunately, this idea is quite unpractical, not only because of tedious logistics (ever had to deal with customs?), but also because suddenly you won’t be able to run your own pipeline. However, it is a good enough thought that at some point made a clever engineer wonder whether there was a way to ship an entire system without physically delivering the computer. And that’s exactly what they designed.

40ft x 8ft (9ft 6") One trip high cube shipping container bl
Best way to make sure your collaborators on the other side of the world can run your pipeline — just pack your desktop in one of these, and ship it away!
Continue reading

CASP14: what Google DeepMind’s AlphaFold 2 really achieved, and what it means for protein folding, biology and bioinformatics

Disclaimer: this post is an opinion piece based on the experience and opinions derived from attending the CASP14 conference as a doctoral student researching protein modelling. When provided, quotes have been extracted from my notes of the event, and while I hope to have captured them as accurately as possible, I cannot guarantee that they are a word-by-word facsimile of what the individuals said. Neither the Oxford Protein Informatics Group nor I accept any responsibility for the content of this post.

You might have heard it from the scientific or regular press, perhaps even from DeepMind’s own blog. Google ‘s AlphaFold 2 indisputably won the 14th Critical Assessment of Structural Prediction competition, a biannual blind test where computational biologists try to predict the structure of several proteins whose structure has been determined experimentally — yet not publicly released. Their results are so incredibly accurate that many have hailed this code as the solution to the long-standing protein structure prediction problem.

Continue reading