Category Archives: Protein Structure

Conference Summary: MGMS Adaptive Immune Receptors Meeting 2024

On 5th April 2024, over 60 researchers braved the train strikes and gusty weather to gather at Lady Margaret Hall in Oxford and engage in a day full of scientific talks, posters and discussions on the topic of adaptive immune receptor (AIR) analysis!

Continue reading

Dockerized Colabfold for large-scale batch predictions

Alphafold is great, however it’s not suited for large batch predictions for 2 main reasons. Firstly, there is no native functionality for predicting structures off multiple fasta sequences (although a custom batch prediction script can be written pretty easily). Secondly, the multiple sequence alignment (MSA) step is heavy and running MSAs for, say, 10,000 sequences at a tractable speed requires some serious hardware.

Fortunately, an alternative to Alphafold has been released and is now widely used; Colabfold. For many, Colabfold’s primary strength is being cloud-based and that prediction requests can be submitted on Google Colab, thereby being extremely user-friendly by avoiding local installations. However, I would argue the greatest value Colabfold brings is a massive MSA speed up (40-60 fold) by replacing HHBlits and BLAST with MMseq2. This, and the fact batches of sequences can be natively processed facilitates a realistic option for predicting thousands of structures (this could still take days on a pair of v100s depending on sequence length etc, but its workable).

In my opinion the cleanest local installation and simplest usage of Colabfold is via Docker containers, for which both a Dockerfile and pre-built docker image have been released. Unfortunately, the Docker image does not come packaged with the necessary setup_databases.sh script, which is required to build a local sequence database. By default the MSAs are run on the Colabfold public server, which is a shared resource and can only process a total of a few thousand MSAs per day.

The following accordingly outlines preparatory steps for 100% local, batch predictions (setting up the database can in theory be done in 1 line via a mount, but I was getting a weird wget permissions error so have broken it up to first fetch the file on the local):

Pull the relevant colabfold docker image (container registry):

docker pull ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2

Create a cache to store weights:

mkdir cache

Download the model weights:

docker run -ti --rm -v path/to/cache:/cache ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2 python -m colabfold.download

Fetch the setup_databases.sh script

wget https://github.com/sokrypton/ColabFold/blob/main/setup_databases.sh 

Spin up a container. The container will exit as soon as the first command is run, so we need to be a bit hacky by running an infinite command in the background:

CONTAINER_ID=$(docker run -d ghcr.io/sokrypton/colabfold:1.5.5 cuda12.2.2 /bin/bash -c "tail -f /dev/null")

Copy the setup_databases.sh script to the relevant path in the container and create a databases directory:

docker cp ./setup_databases.sh $CONTAINER_ID:/usr/local/envs/colabfold/bin/ 
docker exec $CONTAINER_ID mkdir /databases

Run the setup script. This will download and prepare the databases (~2TB once extracted):

docker exec $CONTAINER_ID /usr/local/envs/colabfold/bin/setup_databases.sh /databases/ 

Copy the databases back to the host and clean up:

docker cp $CONTAINER_ID:/databases ./ 
docker stop $CONTAINER_ID
docker rm $CONTAINER_ID

You should now be at a stage where batch predictions can be run, for which I have provided a template script (uses a fasta file with multiple sequences) below. It’s worth noting that maximum search speeds can be achieved by loading the database into memory and pre-indexing, but this requires about 1TB of RAM, which I don’t have.

There are 2 key processes that I prefer to log separately, colabfold_search and colabfold_batch:

#!/bin/bash

# Define the paths for database, input FASTA, and outputs

db_path="path/to/database"
input_fasta="path/to/fasta/file.fasta"
output_path="path/to/output/directory"
log_path="path/to/logs/directory"
cache_path="path/to/weights/cache"

# Run Docker container to execute colabfold_search and colabfold_batch 

time docker run --gpus all -v "${db_path}:/database" -v "${input_fasta}:/input.fasta" -v "${output_path}:/predictions" -v "${log_path}:/logs" -v "${cache_path}:/cache"
 ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2 /bin/bash -c "colabfold_search --mmseqs /usr/local/envs/colabfold/bin/mmseqs /input.fasta /database msas > /logs/search.log 2>&1 && colabfold_batch msas /predictions > /logs/batch.log 2>&1"

Working with PDB Structures in Pandas

Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.

Continue reading

The stuff MDAnalysis didn’t implement: CPU Parallel HOLE conductance analysis

Some time ago, I needed to find a way to computationally estimate conductance values for every protein frame from several molecular dynamics (MD) trajectories.

In a previous post, I wrote about how to clean the resulting instant conductance timeseries from outliers. But, I never described how I generated these timeseries.

In this post, I will show how you can parallelise the computation of instant conductance given an MD trajectory. I will touch on the difficulties of this process. And why I had to implement a custom tool for it given that MDAnalysis seems to already have implemented a routine of this sort. Finally, I will provide two Python scripts that you can easily adapt to run your parallel calculations – for which I’ll provide some important notes you don’t wanna skip.

Violin plots of conductance distributions from 64 molecular dynamic trajectories with 1000 frames each.
Continue reading

Current strategies to predict structures of multiple protein conformational states

Since the release of AlphaFold2 (AF2), the problem of protein structure prediction is widely believed to be solved. Current structure prediction tools, such as AF2, are able to model most proteins with high accuracy. These methods, however, have a major limitation as they have been trained to predict a single structure for a given protein. Proteins are highly dynamic molecules, and their function often depends on transitions between several conformational states. Despite research focusing on the task of predicting the structures of multiple conformations of a protein, currently, no accurate and reliable method is available. In this blog post, I will provide a short overview of the strategies developed for predicting protein conformations. I have grouped these into three sets of related approaches. To conclude, I will also demonstrate how to run one of these strategies on your own.

Continue reading

What can you do with the OPIG Immunoinformatics Suite? v3.0

OPIG’s growing immunoinformatics team continues to develop and openly distribute a wide variety of databases and software packages for antibody/nanobody/T-cell receptor analysis. Below is a summary of all the latest updates (follows on from v1.0 and v2.0).

Continue reading

9th Joint Sheffield Conference on Cheminformatics

Over the next few days, researchers from around the world will be gathering in Sheffield for the 9th Joint Sheffield Conference on Cheminformatics. As one of the organizers (wearing my Molecular Graphics and Modeling Society ‘hat’), I can say we have an exciting array of speakers and sessions:

  • De Novo Design
  • Open Science
  • Chemical Space
  • Physics-based Modelling
  • Machine Learning
  • Property Prediction
  • Virtual Screening
  • Case Studies
  • Molecular Representations

It has traditionally taken place every three years, but despite the global pandemic it is returning this year, once again in person in the excellent conference facilities at The Edge. You can download the full programme in iCal format, and here is the conference calendar:

Continue reading

Checking your PDB file for clashing atoms

Detecting atom clashes in protein structures can be useful in a number of scenarios. For example if you are just about to start some molecular dynamics simulation, or if you want to check that a structure generated by a deep learning model is reasonable. It is quite straightforward to code, but I get the feeling that these sort of functions have been written from scratch hundreds of times. So to save you the effort, here is my implementation!!!

Continue reading