Category Archives: How To

How to do things. doh.

How to write a review paper as a first year PhD student

As a first year PhD student, it is not an uncommon thing to be asked to write a review paper on your subject area. It is both a great way to get acquainted with your research field and to get the background portion of your thesis completed early. However, it can seem like a daunting task to go from knowing almost nothing about your research field to producing something of interest for experts who have spent years studying your subject matter.

In my first year, I was exactly in this position and I found very little online to help guide this process. Thus, here is my reflective look at writing a review paper that will hopefully help someone else in the future.

Continue reading

Dockerized Colabfold for large-scale batch predictions

Alphafold is great, however it’s not suited for large batch predictions for 2 main reasons. Firstly, there is no native functionality for predicting structures off multiple fasta sequences (although a custom batch prediction script can be written pretty easily). Secondly, the multiple sequence alignment (MSA) step is heavy and running MSAs for, say, 10,000 sequences at a tractable speed requires some serious hardware.

Fortunately, an alternative to Alphafold has been released and is now widely used; Colabfold. For many, Colabfold’s primary strength is being cloud-based and that prediction requests can be submitted on Google Colab, thereby being extremely user-friendly by avoiding local installations. However, I would argue the greatest value Colabfold brings is a massive MSA speed up (40-60 fold) by replacing HHBlits and BLAST with MMseq2. This, and the fact batches of sequences can be natively processed facilitates a realistic option for predicting thousands of structures (this could still take days on a pair of v100s depending on sequence length etc, but its workable).

In my opinion the cleanest local installation and simplest usage of Colabfold is via Docker containers, for which both a Dockerfile and pre-built docker image have been released. Unfortunately, the Docker image does not come packaged with the necessary setup_databases.sh script, which is required to build a local sequence database. By default the MSAs are run on the Colabfold public server, which is a shared resource and can only process a total of a few thousand MSAs per day.

The following accordingly outlines preparatory steps for 100% local, batch predictions (setting up the database can in theory be done in 1 line via a mount, but I was getting a weird wget permissions error so have broken it up to first fetch the file on the local):

Pull the relevant colabfold docker image (container registry):

docker pull ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2

Create a cache to store weights:

mkdir cache

Download the model weights:

docker run -ti --rm -v path/to/cache:/cache ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2 python -m colabfold.download

Fetch the setup_databases.sh script

wget https://github.com/sokrypton/ColabFold/blob/main/setup_databases.sh 

Spin up a container. The container will exit as soon as the first command is run, so we need to be a bit hacky by running an infinite command in the background:

CONTAINER_ID=$(docker run -d ghcr.io/sokrypton/colabfold:1.5.5 cuda12.2.2 /bin/bash -c "tail -f /dev/null")

Copy the setup_databases.sh script to the relevant path in the container and create a databases directory:

docker cp ./setup_databases.sh $CONTAINER_ID:/usr/local/envs/colabfold/bin/ 
docker exec $CONTAINER_ID mkdir /databases

Run the setup script. This will download and prepare the databases (~2TB once extracted):

docker exec $CONTAINER_ID /usr/local/envs/colabfold/bin/setup_databases.sh /databases/ 

Copy the databases back to the host and clean up:

docker cp $CONTAINER_ID:/databases ./ 
docker stop $CONTAINER_ID
docker rm $CONTAINER_ID

You should now be at a stage where batch predictions can be run, for which I have provided a template script (uses a fasta file with multiple sequences) below. It’s worth noting that maximum search speeds can be achieved by loading the database into memory and pre-indexing, but this requires about 1TB of RAM, which I don’t have.

There are 2 key processes that I prefer to log separately, colabfold_search and colabfold_batch:

#!/bin/bash

# Define the paths for database, input FASTA, and outputs

db_path="path/to/database"
input_fasta="path/to/fasta/file.fasta"
output_path="path/to/output/directory"
log_path="path/to/logs/directory"
cache_path="path/to/weights/cache"

# Run Docker container to execute colabfold_search and colabfold_batch 

time docker run --gpus all -v "${db_path}:/database" -v "${input_fasta}:/input.fasta" -v "${output_path}:/predictions" -v "${log_path}:/logs" -v "${cache_path}:/cache"
 ghcr.io/sokrypton/colabfold:1.5.5-cuda12.2.2 /bin/bash -c "colabfold_search --mmseqs /usr/local/envs/colabfold/bin/mmseqs /input.fasta /database msas > /logs/search.log 2>&1 && colabfold_batch msas /predictions > /logs/batch.log 2>&1"

Open Source PyMOL installation on Windows

A year ago, I used Gheorghe Rotaru’s helpful blog post to install PyMOL. Unfortunately, after resetting my computer, I have just discovered that some of the links are broken. Here are the installation steps with new links provided by Christoph Gohlke, who generously offers pre-compiled Windows versions of the latest PyMOL software along with all its requirements.

Install the latest version of Python 3 for Windows:
Download the Windows Installer (x-bit) for Python 3 from their website, with x being your Windows architecture – 32 or 64.

Follow the instructions provided on how to install Python. You can confirm the installation by running ‘py’ in PowerShell.

Continue reading

Fail fast

While scrolling through my Instagram reels feed, I came across a reel of Jensen Huang, NVIDIA’s CEO, talking about the need to fail fast, which motivated me to write a post. ‘Fail fast’ is a recent piece of advice I have been hearing since I embarked on my PhD; fail fast on the research directions that we plan to pursue so that we can understand the difficulties and limitations of the research problems and methods used which will in turn give us more time to finetune our problem and develop more nuanced approaches. Since childhood, most of us have been taught that failures eventually lead to success and that persevering towards success is critical. However, one thing that I could not come to terms with is the narrative of several failures ‘magically’ leading to success. If you were destined to be successful, why would you even fail? And also, for every failure-to-success story we hear, there are many other stories of failure that we don’t.

Continue reading

In defence of chaos

I commend you on your skepticism, but even the skeptical mind must be prepared to accept the unacceptable when there is no alternative. If it looks like a duck, and quacks like a duck, we have at least to consider the possibility that we have a small aquatic bird of the family Anatidæ on our hands.

Douglas Adams

It’s not every day that someone recommends a new whizzbang note-taking software. It’s every second day, or third if you’re lucky. They all have their bells and whistles: Obsidian turns your notes into a funky graph that pulses with information, the web of complexity of your stored knowledge entrapping your attention as you dazzle in its splendour while also the little circles jostle and bounce in decadent harmony. Notion’s aesthetic simplicity belies its comprehensive capabilities, from writing your notes so you don’t need to, to exporting to the web so that the rest of us can read what you didn’t write because you didn’t need to. To pronounce Microsoft OneNote requires only five syllables, efficiently cramming in two extra words while only being one bit slower to say than the mysterious rock competitor. Apple notes can be shared with all the other Apple people who live their happy Apple lives in happy Apple land – and sometimes this even works!

Continue reading

Working with PDB Structures in Pandas

Pandas is one of my favourite data analysis tools working in Python! The data frames offer a lot of power and organization to any data analysis task. Here at OPIG we work with a lot of protein structure data coming from PDB files. In the following article I will go through an example of how I use pandas data frames to analyze PDB data.

Continue reading

250 Trips on the Oxford Tube: what I’ve learnt

The Oxford Tube is a bus service that shuttles people between Oxford and London taking approximately 1 hour and 30 minutes. I have now taken the bus over 250 times which is approximately 375 hours or a fortnight of my life.

In this time spent on the bus, I have discovered some tips and tricks that make the journey ever so slightly more bearable. I shall share them so that others can optimise their experience. Enjoy!

Continue reading

Optimising Transformer Training

Training a large transformer model can be a multi-day, if not multi-week, ordeal. Especially if you’re using cloud compute, this can be a very expensive affair, not to mention the environmental impact. It’s therefore worth spending a couple days trying to optimise your training efficiency before embarking on a large scale training run. Here, I’ll run through three strategies you can take which (hopefully) shouldn’t degrade performance, while giving you some free speed. These strategies will also work for any other models using linear layers.

I wont go into too much of the technical detail of any of the techniques, but if you’d like to dig into any of them further I’d highly recommend the Nvidia Deep Learning Performance Guide.

Training With Mixed Precision

Training with mixed precision can be as simple as adding a few lines of code, depending on your deep learning framework. It also potentially provides the biggest boost to performance of any of these techniques. Training throughput can be increase by up to three-fold with little degradation in performance – and who doesn’t like free speed?

Continue reading

Tip and Tricks to correct a Cuda Toolkit installation in Conda

On the eastern side of Oxfordshire are the Cotswolds, a pleasant hill range with a curious etymology: the hills of the goddess Cuda (maybe, see footnote). Cuda is a powerful yet wrathful goddess, and to be in her good side it does feel like druidry. The first druidic test is getting software to work: the wild magic makes the rules of this test change continually. Therefore, I am writing a summary of what works as of Late 2023.

Continue reading

How to get more information from slurm?

So the servers you use have Slurm as their job scheduler? Blopig has very good resources to know how to navigate a Slurm environment. 

If you are new to SLURMing, I highly recommend Alissa Hummer’s post . There, she explains in detail what you will need to submit, check or cancel a job, even how to run a job with more than one script in parallel by dividing it into tasks. She is so good that by reading her post you will learn how to move files across the servers, create and manage SSH keys as well as setting up Miniconda and github in a Slurm server.

And Blopig has even more to offer with Maranga Mokaya’s and Oliver Turnbull’s posts as nice complements to have a more advanced use of Slurm. They help with the use of array jobs, more efficient file copying and creating aliases (shortcuts) for frequently used commands.

So… What could I possibly have to add to that?

Well, suppose you are concerned that you or one of your mates might flood the server (not that it has ever happened to me, but just in case).

Helga G. Patak
From heyarnold official https://giphy.com/ page

How would you go by figuring out how many cores are active? How much memory is left? Which GPU does that server use? Fear not, as I have some basic tricks that might help you.

Get information about the servers and nodes:

A pretty straight forward way of getting to know some information on slurm servers is the use of the command:

sinfo -M ALL

Which will give you information on partition names, if that partition is available or not, how many nodes it has, its usage state and a list with those nodes.

CLUSTER: name_of_cluster
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
low         up  7-00:00.0    1  idle  node_name.server.address 

The -M ALL argument is used to show every cluster. If you know the name of the cluster you can use:

sinfo -M name_of_cluster

But what if you want to know not only if it is up and being used, but how much of its resource is free? Fear not, much is there to learn.

You can use the same sinfo command followed by some arguments, that will give you what you want. And the magic command is:

sinfo -o "%all" -M all

This will show you a lot of information abou every partition of every cluster

CLUSTER: name_of_cluster
AVAIL|ACTIVE_FEATURES|CPUS|TMP_DISK|FREE_MEM|AVAIL_FEATURES|GROUPS|OVERSUBSCRIBE|TIMELIMIT|MEMORY|HOSTNAMES|NODE_ADDR|PRIO_TIER|ROOT|JOB_SIZE|STATE|USER|VERSION|WEIGHT|S:C:T|NODES(A/I) |MAX_CPUS_PER_NODE |CPUS(A/I/O/T) |NODES |REASON |NODES(A/I/O/T) |GRES |TIMESTAMP |PRIO_JOB_FACTOR |DEFAULTTIME |PREEMPT_MODE |NODELIST |CPU_LOAD |PARTITION |PARTITION |ALLOCNODES |STATE |USER |CLUSTER |SOCKETS |CORES |THREADS 
A light pink pudgy penguin  angrily saying: That's too much information.
From heyarnold official https://giphy.com/ page

Which is a lot.

So, how can you make it more digestible and filter only the info that you want?

Always start with:

sinfo -M ALL -o "%n" 

And inside the quotations you should add the info you would like to know. The %n arguments serves to show every node, the hostname, in each cluster. If you want to know how much free memory there is in each node you can use:

sinfo -M ALL -o "%n %e"

In case you would like to know how the CPUs are being used (how many are allocated, idle, other and total) you should use

sinfo -M ALL -o "%n %e %C"

Well, I could give more and more examples, but it is more efficient to just leave the table of possible arguments here. They come from slurm documentation.

ArgumentWhat does it do?
%allPrint all fields available for this data type with a vertical bar separating each field.
%aState/availability of a partition.
%ANumber of nodes by state in the format “allocated/idle”. Do not use this with a node state option (“%t” or “%T”) or the different node states will be placed on separate lines.
%bFeatures currently active on the nodes, also see %f.
%BThe max number of CPUs per node available to jobs in the partition.
%cNumber of CPUs per node.
%CNumber of CPUs by state in the format “allocated/idle/other/total”. Do not use this with a node state option (“%t” or “%T”) or the different node states will be placed on separate lines.
%dSize of temporary disk space per node in megabytes.
%DNumber of nodes.
%eThe total memory, in MB, currently free on the node as reported by the OS. This value is for informational use only and is not used for scheduling.
%EThe reason a node is unavailable (down, drained, or draining states).
%fFeatures available the nodes, also see %b.
%FNumber of nodes by state in the format “allocated/idle/other/total”. Note the use of this format option with a node state format option (“%t” or “%T”) will result in the different node states being be reported on separate lines.
%gGroups which may use the nodes.
%GGeneric resources (gres) associated with the nodes. (“Graph Card” that the node uses)
%hPrint the OverSubscribe setting for the partition.
%HPrint the timestamp of the reason a node is unavailable.
%iIf a node is in an advanced reservation print the name of that reservation.
%IPartition job priority weighting factor.
%lMaximum time for any job in the format “days-hours:minutes:seconds”
%LDefault time for any job in the format “days-hours:minutes:seconds”
%mSize of memory per node in megabytes.
%MPreemptionMode.
%nList of node hostnames.
%NList of node names.
%oList of node communication addresses.
%OCPU load of a node as reported by the OS.
%pPartition scheduling tier priority.
%PPartition name followed by “*” for the default partition, also see %R.
%rOnly user root may initiate jobs, “yes” or “no”.
%RPartition name, also see %P.
%sMaximum job size in nodes.
%SAllowed allocating nodes.
%tState of nodes, compact form.
%TState of nodes, extended form.
%uPrint the user name of who set the reason a node is unavailable.
%UPrint the user name and uid of who set the reason a node is unavailable.
%vPrint the version of the running slurmd daemon.
%VPrint the cluster name if running in a federation.
%wScheduling weight of the nodes.
%XNumber of sockets per node.
%YNumber of cores per socket.
%zExtended processor information: number of sockets, cores, threads (S:C:T) per node.
%ZNumber of threads per core.

And there you have it! Now you can know what is going on your slurm clusters and avoid job-blocking your peers.

If you want to know more about slurm, keep an eye on Blopig!