On 5th April 2024, over 60 researchers braved the train strikes and gusty weather to gather at Lady Margaret Hall in Oxford and engage in a day full of scientific talks, posters and discussions on the topic of adaptive immune receptor (AIR) analysis!
Continue readingCategory Archives: Protein Structure
Dockerized Colabfold for large-scale batch predictions
Alphafold is great, however it’s not suited for large batch predictions for 2 main reasons. Firstly, there is no native functionality for predicting structures off multiple fasta sequences (although a custom batch prediction script can be written pretty easily). Secondly, the multiple sequence alignment (MSA) step is heavy and running MSAs for, say, 10,000 sequences at a tractable speed requires some serious hardware.
Fortunately, an alternative to Alphafold has been released and is now widely used; Colabfold. For many, Colabfold’s primary strength is being cloud-based and that prediction requests can be submitted on Google Colab, thereby being extremely user-friendly by avoiding local installations. However, I would argue the greatest value Colabfold brings is a massive MSA speed up (40-60 fold) by replacing HHBlits and BLAST with MMseq2. This, and the fact batches of sequences can be natively processed facilitates a realistic option for predicting thousands of structures (this could still take days on a pair of v100s depending on sequence length etc, but its workable).
In my opinion the cleanest local installation and simplest usage of Colabfold is via Docker containers, for which both a Dockerfile and pre-built docker image have been released. Unfortunately, the Docker image does not come packaged with the necessary setup_databases.sh script, which is required to build a local sequence database. By default the MSAs are run on the Colabfold public server, which is a shared resource and can only process a total of a few thousand MSAs per day.
The following accordingly outlines preparatory steps for 100% local, batch predictions (setting up the database can in theory be done in 1 line via a mount, but I was getting a weird wget permissions error so have broken it up to first fetch the file on the local):
Pull the relevant colabfold docker image (container registry):
docker pull ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2
Create a cache to store weights:
mkdir cache
Download the model weights:
docker run -ti --rm -v path/to/cache:/cache ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2 python -m colabfold.download
Fetch the setup_databases.sh script
wget https://github.com/sokrypton/ColabFold/blob/main/setup_databases.sh
Spin up a container. The container will exit as soon as the first command is run, so we need to be a bit hacky by running an infinite command in the background:
CONTAINER_ID=$(docker run -d ghcr.io/sokrypton/colabfold:1.5.5 cuda12.2.2 /bin/bash -c "tail -f /dev/null")
Copy the setup_databases.sh script to the relevant path in the container and create a databases directory:
docker cp ./setup_databases.sh $CONTAINER_ID:/usr/local/envs/colabfold/bin/
docker exec $CONTAINER_ID mkdir /databases
Run the setup script. This will download and prepare the databases (~2TB once extracted):
docker exec $CONTAINER_ID /usr/local/envs/colabfold/bin/setup_databases.sh /databases/
Copy the databases back to the host and clean up:
docker cp $CONTAINER_ID:/databases ./
docker stop $CONTAINER_ID
docker rm $CONTAINER_ID
You should now be at a stage where batch predictions can be run, for which I have provided a template script (uses a fasta file with multiple sequences) below. It’s worth noting that maximum search speeds can be achieved by loading the database into memory and pre-indexing, but this requires about 1TB of RAM, which I don’t have.
There are 2 key processes that I prefer to log separately, colabfold_search and colabfold_batch:
#!/bin/bash
# Define the paths for database, input FASTA, and outputs
db_path="path/to/database"
input_fasta="path/to/fasta/file.fasta"
output_path="path/to/output/directory"
log_path="path/to/logs/directory"
cache_path="path/to/weights/cache"
# Run Docker container to execute colabfold_search and colabfold_batch
time docker run --gpus all -v "${db_path}:/database" -v "${input_fasta}:/input.fasta" -v "${output_path}:/predictions" -v "${log_path}:/logs" -v "${cache_path}:/cache"
ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2 /bin/bash -c "colabfold_search --mmseqs /usr/local/envs/colabfold/bin/mmseqs /input.fasta /database msas > /logs/search.log 2>&1 && colabfold_batch msas /predictions > /logs/batch.log 2>&1"
Working with PDB Structures in Pandas
Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.
Continue readingThe stuff MDAnalysis didn’t implement: CPU Parallel HOLE conductance analysis
Some time ago, I needed to find a way to computationally estimate conductance values for every protein frame from several molecular dynamics (MD) trajectories.
In a previous post, I wrote about how to clean the resulting instant conductance timeseries from outliers. But, I never described how I generated these timeseries.
In this post, I will show how you can parallelise the computation of instant conductance given an MD trajectory. I will touch on the difficulties of this process. And why I had to implement a custom tool for it given that MDAnalysis
seems to already have implemented a routine of this sort. Finally, I will provide two Python scripts that you can easily adapt to run your parallel calculations – for which I’ll provide some important notes you don’t wanna skip.
Thera-SAbDab Updates (2023)
This blogpost is a short notice about recent quality-of-life and feature updates to our Therapeutic Structural Antibody Database (Thera-SAbDab). We hope these changes will make the database more user-friendly and facilitate new analyses…
Continue readingCurrent strategies to predict structures of multiple protein conformational states
Since the release of AlphaFold2 (AF2), the problem of protein structure prediction is widely believed to be solved. Current structure prediction tools, such as AF2, are able to model most proteins with high accuracy. These methods, however, have a major limitation as they have been trained to predict a single structure for a given protein. Proteins are highly dynamic molecules, and their function often depends on transitions between several conformational states. Despite research focusing on the task of predicting the structures of multiple conformations of a protein, currently, no accurate and reliable method is available. In this blog post, I will provide a short overview of the strategies developed for predicting protein conformations. I have grouped these into three sets of related approaches. To conclude, I will also demonstrate how to run one of these strategies on your own.
Continue readingWhat can you do with the OPIG Immunoinformatics Suite? v3.0
OPIG’s growing immunoinformatics team continues to develop and openly distribute a wide variety of databases and software packages for antibody/nanobody/T-cell receptor analysis. Below is a summary of all the latest updates (follows on from v1.0 and v2.0).
Continue reading9th Joint Sheffield Conference on Cheminformatics
Over the next few days, researchers from around the world will be gathering in Sheffield for the 9th Joint Sheffield Conference on Cheminformatics. As one of the organizers (wearing my Molecular Graphics and Modeling Society ‘hat’), I can say we have an exciting array of speakers and sessions:
- De Novo Design
- Open Science
- Chemical Space
- Physics-based Modelling
- Machine Learning
- Property Prediction
- Virtual Screening
- Case Studies
- Molecular Representations
It has traditionally taken place every three years, but despite the global pandemic it is returning this year, once again in person in the excellent conference facilities at The Edge. You can download the full programme in iCal format, and here is the conference calendar:
Continue readingChecking your PDB file for clashing atoms
Detecting atom clashes in protein structures can be useful in a number of scenarios. For example if you are just about to start some molecular dynamics simulation, or if you want to check that a structure generated by a deep learning model is reasonable. It is quite straightforward to code, but I get the feeling that these sort of functions have been written from scratch hundreds of times. So to save you the effort, here is my implementation!!!
Continue readingCross-linking mass-spectrometry: a guide to conformational confusions.
In the age of highly accurate structure prediction methods, I have seen more and more usage of cross-linking mass-spectrometry (XL-MS) and I wanted to understand its limitations more carefully. This is more of a guide to interpreting the data rather than how to perform the experiment.
Continue reading