Monthly Archives: April 2024

Dockerized Colabfold for large-scale batch predictions

Alphafold is great, however it’s not suited for large batch predictions for 2 main reasons. Firstly, there is no native functionality for predicting structures off multiple fasta sequences (although a custom batch prediction script can be written pretty easily). Secondly, the multiple sequence alignment (MSA) step is heavy and running MSAs for, say, 10,000 sequences at a tractable speed requires some serious hardware.

Fortunately, an alternative to Alphafold has been released and is now widely used; Colabfold. For many, Colabfold’s primary strength is being cloud-based and that prediction requests can be submitted on Google Colab, thereby being extremely user-friendly by avoiding local installations. However, I would argue the greatest value Colabfold brings is a massive MSA speed up (40-60 fold) by replacing HHBlits and BLAST with MMseq2. This, and the fact batches of sequences can be natively processed facilitates a realistic option for predicting thousands of structures (this could still take days on a pair of v100s depending on sequence length etc, but its workable).

In my opinion the cleanest local installation and simplest usage of Colabfold is via Docker containers, for which both a Dockerfile and pre-built docker image have been released. Unfortunately, the Docker image does not come packaged with the necessary setup_databases.sh script, which is required to build a local sequence database. By default the MSAs are run on the Colabfold public server, which is a shared resource and can only process a total of a few thousand MSAs per day.

The following accordingly outlines preparatory steps for 100% local, batch predictions (setting up the database can in theory be done in 1 line via a mount, but I was getting a weird wget permissions error so have broken it up to first fetch the file on the local):

Pull the relevant colabfold docker image (container registry):

docker pull ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2

Create a cache to store weights:

mkdir cache

Download the model weights:

docker run -ti --rm -v path/to/cache:/cache ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2 python -m colabfold.download

Fetch the setup_databases.sh script

wget https://github.com/sokrypton/ColabFold/blob/main/setup_databases.sh 

Spin up a container. The container will exit as soon as the first command is run, so we need to be a bit hacky by running an infinite command in the background:

CONTAINER_ID=$(docker run -d ghcr.io/sokrypton/colabfold:1.5.5 cuda12.2.2 /bin/bash -c "tail -f /dev/null")

Copy the setup_databases.sh script to the relevant path in the container and create a databases directory:

docker cp ./setup_databases.sh $CONTAINER_ID:/usr/local/envs/colabfold/bin/ 
docker exec $CONTAINER_ID mkdir /databases

Run the setup script. This will download and prepare the databases (~2TB once extracted):

docker exec $CONTAINER_ID /usr/local/envs/colabfold/bin/setup_databases.sh /databases/ 

Copy the databases back to the host and clean up:

docker cp $CONTAINER_ID:/databases ./ 
docker stop $CONTAINER_ID
docker rm $CONTAINER_ID

You should now be at a stage where batch predictions can be run, for which I have provided a template script (uses a fasta file with multiple sequences) below. It’s worth noting that maximum search speeds can be achieved by loading the database into memory and pre-indexing, but this requires about 1TB of RAM, which I don’t have.

There are 2 key processes that I prefer to log separately, colabfold_search and colabfold_batch:

#!/bin/bash

# Define the paths for database, input FASTA, and outputs

db_path="path/to/database"
input_fasta="path/to/fasta/file.fasta"
output_path="path/to/output/directory"
log_path="path/to/logs/directory"
cache_path="path/to/weights/cache"

# Run Docker container to execute colabfold_search and colabfold_batch 

time docker run --gpus all -v "${db_path}:/database" -v "${input_fasta}:/input.fasta" -v "${output_path}:/predictions" -v "${log_path}:/logs" -v "${cache_path}:/cache"
 ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2 /bin/bash -c "colabfold_search --mmseqs /usr/local/envs/colabfold/bin/mmseqs /input.fasta /database msas > /logs/search.log 2>&1 && colabfold_batch msas /predictions > /logs/batch.log 2>&1"

Tracking the change in ML performance for popular small molecule benchmarks

The power of machine learning (ML) techniques has captivated the field of small molecule drug discovery. Increasingly, researchers and organisations have employed ML to create more accurate algorithms to improve the efficiency of the discovery process.

To be published, methods have to prove they have improved upon others. Often, methods are tested against the same benchmarks within a field, allowing us to track progress over time. To explore the rate of improvement, I curated the performance on three popular benchmarks. The first benchmark is CASF 2016, used to test the accuracy of methods that predict the binding affinity of experimental determined protein-ligand complexes. Accuracy was measured using the Pearson’s R value between predicted and experimental affinity values.

Continue reading

RSC Fragments 2024

I attended RSC Fragments 2024 (Hinxton, 4–5 March 2024), a conference dedicated to fragment-based drug discovery. The various talks were really good, because they gave overviews of projects involving teams across long stretches of time. As a result there were no slides discussing wet lab protocol optimisations and not a single Western blot was seen. The focus was primarily either illustrating a discovery platform or recounting a declassified campaign. The latter were interesting, although I’d admit I wish there had been more talk of organic chemistry —there was not a single moan/gloat about a yield. This top-down focus was nice as topics kept overlapping, namely:

  • Target choice,
  • covalents,
  • molecular glues,
  • whether to escape Flatland,
  • thermodynamics, and
  • cryptic pockets
Continue reading