Dockerized Colabfold for large-scale batch predictions

Alphafold is great, however it’s not suited for large batch predictions for 2 main reasons. Firstly, there is no native functionality for predicting structures off multiple fasta sequences (although a custom batch prediction script can be written pretty easily). Secondly, the multiple sequence alignment (MSA) step is heavy and running MSAs for, say, 10,000 sequences at a tractable speed requires some serious hardware.

Fortunately, an alternative to Alphafold has been released and is now widely used; Colabfold. For many, Colabfold’s primary strength is being cloud-based and that prediction requests can be submitted on Google Colab, thereby being extremely user-friendly by avoiding local installations. However, I would argue the greatest value Colabfold brings is a massive MSA speed up (40-60 fold) by replacing HHBlits and BLAST with MMseq2. This, and the fact batches of sequences can be natively processed facilitates a realistic option for predicting thousands of structures (this could still take days on a pair of v100s depending on sequence length etc, but its workable).

In my opinion the cleanest local installation and simplest usage of Colabfold is via Docker containers, for which both a Dockerfile and pre-built docker image have been released. Unfortunately, the Docker image does not come packaged with the necessary setup_databases.sh script, which is required to build a local sequence database. By default the MSAs are run on the Colabfold public server, which is a shared resource and can only process a total of a few thousand MSAs per day.

The following accordingly outlines preparatory steps for 100% local, batch predictions (setting up the database can in theory be done in 1 line via a mount, but I was getting a weird wget permissions error so have broken it up to first fetch the file on the local):

Pull the relevant colabfold docker image (container registry):

docker pull ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2

Create a cache to store weights:

mkdir cache

Download the model weights:

docker run -ti --rm -v path/to/cache:/cache ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2 python -m colabfold.download

Fetch the setup_databases.sh script

wget https://github.com/sokrypton/ColabFold/blob/main/setup_databases.sh 

Spin up a container. The container will exit as soon as the first command is run, so we need to be a bit hacky by running an infinite command in the background:

CONTAINER_ID=$(docker run -d ghcr.io/sokrypton/colabfold:1.5.5 cuda12.2.2 /bin/bash -c "tail -f /dev/null")

Copy the setup_databases.sh script to the relevant path in the container and create a databases directory:

docker cp ./setup_databases.sh $CONTAINER_ID:/usr/local/envs/colabfold/bin/ 
docker exec $CONTAINER_ID mkdir /databases

Run the setup script. This will download and prepare the databases (~2TB once extracted):

docker exec $CONTAINER_ID /usr/local/envs/colabfold/bin/setup_databases.sh /databases/ 

Copy the databases back to the host and clean up:

docker cp $CONTAINER_ID:/databases ./ 
docker stop $CONTAINER_ID
docker rm $CONTAINER_ID

You should now be at a stage where batch predictions can be run, for which I have provided a template script (uses a fasta file with multiple sequences) below. It’s worth noting that maximum search speeds can be achieved by loading the database into memory and pre-indexing, but this requires about 1TB of RAM, which I don’t have.

There are 2 key processes that I prefer to log separately, colabfold_search and colabfold_batch:

#!/bin/bash

# Define the paths for database, input FASTA, and outputs

db_path="path/to/database"
input_fasta="path/to/fasta/file.fasta"
output_path="path/to/output/directory"
log_path="path/to/logs/directory"
cache_path="path/to/weights/cache"

# Run Docker container to execute colabfold_search and colabfold_batch 

time docker run --gpus all -v "${db_path}:/database" -v "${input_fasta}:/input.fasta" -v "${output_path}:/predictions" -v "${log_path}:/logs" -v "${cache_path}:/cache"
 ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2 /bin/bash -c "colabfold_search --mmseqs /usr/local/envs/colabfold/bin/mmseqs /input.fasta /database msas > /logs/search.log 2>&1 && colabfold_batch msas /predictions > /logs/batch.log 2>&1"

Tracking the change in ML performance for popular small molecule benchmarks

The power of machine learning (ML) techniques has captivated the field of small molecule drug discovery. Increasingly, researchers and organisations have employed ML to create more accurate algorithms to improve the efficiency of the discovery process.

To be published, methods have to prove they have improved upon others. Often, methods are tested against the same benchmarks within a field, allowing us to track progress over time. To explore the rate of improvement, I curated the performance on three popular benchmarks. The first benchmark is CASF 2016, used to test the accuracy of methods that predict the binding affinity of experimental determined protein-ligand complexes. Accuracy was measured using the Pearson’s R value between predicted and experimental affinity values.

Continue reading

RSC Fragments 2024

I attended RSC Fragments 2024 (Hinxton, 4–5 March 2024), a conference dedicated to fragment-based drug discovery. The various talks were really good, because they gave overviews of projects involving teams across long stretches of time. As a result there were no slides discussing wet lab protocol optimisations and not a single Western blot was seen. The focus was primarily either illustrating a discovery platform or recounting a declassified campaign. The latter were interesting, although I’d admit I wish there had been more talk of organic chemistry —there was not a single moan/gloat about a yield. This top-down focus was nice as topics kept overlapping, namely:

  • Target choice,
  • covalents,
  • molecular glues,
  • whether to escape Flatland,
  • thermodynamics, and
  • cryptic pockets
Continue reading

Under-rated or overlooked, these libraries might be helpful.

Discovering a library that massively simplifies the exact thing you just did right after you’ve finished doing the thing you needed to do has to be one of the top 14 worst things about writing code. You might think it’s a part of the life we’ve all chosen, but it doesn’t have to be. Beyond the popular libraries you already know lies a treasure trove of under appreciated packages waiting to be wielded. Being the saint I am, I’ve scoured the depths of pypi.org to find some underrated and hopefully useful packages to make your life a little easier.

Continue reading

Pitfalls of using Pearson’s correlation for comparing model performance

Pearson’s R (correlation coefficient) is a measure of the linear correlation between two variables, giving a value between -1 and 1, where 1 is total positive linear correlation, 0 is no linear correlation, and -1 is total negative linear correlation. While it’s a useful statistic for understanding the relationship between two variables, it is often used to compare the performance of two or more models. For example, imagine we had experimental values that we are predicting and several models’ predictions. Obviously, we would prefer the model with the highest Pearson’s R … or perhaps not?

Continue reading

Open Source PyMOL installation on Windows

A year ago, I used Gheorghe Rotaru’s helpful blog post to install PyMOL. Unfortunately, after resetting my computer, I have just discovered that some of the links are broken. Here are the installation steps with new links provided by Christoph Gohlke, who generously offers pre-compiled Windows versions of the latest PyMOL software along with all its requirements.

Install the latest version of Python 3 for Windows:
Download the Windows Installer (x-bit) for Python 3 from their website, with x being your Windows architecture – 32 or 64.

Follow the instructions provided on how to install Python. You can confirm the installation by running ‘py’ in PowerShell.

Continue reading

An Open-Source CUDA for AMD GPUs – ZLUDA

Lots of work has been put into making AMD designed GPUs to work nicely with GPU accelerated frameworks like PyTorch. Despite this, getting performant code on non-NVIDIA graphics cards can be challenging for both users and developers. Even in the case where the developer has appropriately optimised for each platform there are often gaps in performance where, at the driver-level, instructions to the GPU may not be optimised fully. This is because software developed using CUDA can benefit from optimisations like operation-fusing without having to specify in many cases.

This may not be much of a concern for most researchers as we simply use what is available to us. Most of the time this is usually NVIDIA GPUs and there is hardly a choice to it. NVIDIA is aware of this and prices their products accordingly. Part of the problem is that system designers just dont have an incentive to build AMD platfroms other than for highly specialised machines.

Continue reading

Optimising for PR AUC vs ROC AUC – an intuitive understanding

When training a machine learning (ML) model, our main aim is usually to get the ‘best’ model out the other end in an unbiased manner. Of course, there are other considerations such as quick training and inference, but mostly we want to be good at predicting the right answer.

A number of factors will affect the quality of our final model, including the chosen architecture, optimiser, and – importantly – the metric we are optimising for. So, how should we pick this metric?

Continue reading

3 approaches to linear-memory Transformers

Transformers are a very popular architecture for processing sequential data, notably text and (our interest) proteins. Transformers learn more complex patterns with larger models on more data, as demonstrated by models like GPT-4 and ESM-2. Transformers work by updating tokens according to an attention value computed as a weighted sum of all other tokens. In standard implentations this requires computing the product of a query and key matrix which requires O(N2d) computations and, problematically, O(N2) memory for a sequence of length N and an embedding size of d. To speed up Transformers, and to analyze longer sequences, several variants have been proposed which require only O(N) memory. Broadly, these can be divided into sparse methods, softmax-approximators, and memory-efficient Transformers.

Continue reading

Fail fast

While scrolling through my Instagram reels feed, I came across a reel of Jensen Huang, NVIDIA’s CEO, talking about the need to fail fast, which motivated me to write a post. ‘Fail fast’ is a recent piece of advice I have been hearing since I embarked on my PhD; fail fast on the research directions that we plan to pursue so that we can understand the difficulties and limitations of the research problems and methods used which will in turn give us more time to finetune our problem and develop more nuanced approaches. Since childhood, most of us have been taught that failures eventually lead to success and that persevering towards success is critical. However, one thing that I could not come to terms with is the narrative of several failures ‘magically’ leading to success. If you were destined to be successful, why would you even fail? And also, for every failure-to-success story we hear, there are many other stories of failure that we don’t.

Continue reading