{"id":11159,"date":"2024-04-04T16:41:24","date_gmt":"2024-04-04T15:41:24","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=11159"},"modified":"2024-04-04T16:41:26","modified_gmt":"2024-04-04T15:41:26","slug":"dockerized-colabfold-for-large-scale-batch-predictions","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2024\/04\/dockerized-colabfold-for-large-scale-batch-predictions\/","title":{"rendered":"Dockerized Colabfold for large-scale batch predictions"},"content":{"rendered":"\n<p>Alphafold is great, however it\u2019s not suited for large batch predictions for 2 main reasons. Firstly, there is no native functionality for predicting structures off multiple fasta sequences (although a custom batch prediction script can be written pretty easily). Secondly, the multiple sequence alignment (MSA) step is heavy and running MSAs for, say, 10,000 sequences at a tractable speed requires some serious hardware.<\/p>\n\n\n\n<p>Fortunately, an alternative to Alphafold has been released and is now widely used; Colabfold. For many, Colabfold\u2019s primary strength is being cloud-based and that prediction requests can be submitted on Google Colab, thereby being extremely user-friendly by avoiding local installations. However, I would argue the greatest value Colabfold brings is a massive MSA speed up (40-60 fold) by replacing HHBlits and BLAST with MMseq2. This, and the fact batches of sequences can be natively processed facilitates a realistic option for predicting  thousands of structures (this could still take days on a pair of v100s depending on sequence length etc, but its workable).<\/p>\n\n\n\n<p>In my opinion the cleanest local installation and simplest usage of Colabfold is via Docker containers, for which both a <a href=\"https:\/\/github.com\/sokrypton\/ColabFold\/tree\/main\">Dockerfile<\/a> and pre-built docker <a href=\"https:\/\/github.com\/sokrypton\/ColabFold\/wiki\/Running-ColabFold-in-Docker\">image<\/a> have been released. Unfortunately, the Docker image does not come packaged with the necessary setup_databases.sh script, which is required to build a local sequence database. By default the MSAs are run on the Colabfold public server, which is a shared resource and can only process a total of a few thousand MSAs per day. <\/p>\n\n\n\n<p>The following accordingly outlines preparatory steps for 100% local, batch predictions (setting up the database can in theory be done in 1 line via a mount, but I was getting a weird wget permissions error so have broken it up to first fetch the file on the local):<\/p>\n\n\n\n<p>Pull the relevant colabfold docker image (<a href=\"https:\/\/github.com\/sokrypton\/ColabFold\/pkgs\/container\/colabfold\">container registry<\/a>):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>docker pull ghcr.io\/sokrypton\/colabfold:1.5.5-cuda12.2.2<\/code><\/pre>\n\n\n\n<p>Create a cache to store weights:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>mkdir cache<\/code><\/pre>\n\n\n\n<p>Download the model weights:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>docker run -ti --rm -v path\/to\/cache:\/cache ghcr.io\/sokrypton\/colabfold:1.5.5-cuda12.2.2 python -m colabfold.download\r\n<\/code><\/pre>\n\n\n\n<p>Fetch the setup_databases.sh script<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>wget https:\/\/github.com\/sokrypton\/ColabFold\/blob\/main\/setup_databases.sh \r\n<\/code><\/pre>\n\n\n\n<p>Spin up a container. The container will exit as soon as the first command is run, so we need to be a bit hacky by running an infinite command in the background:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CONTAINER_ID=$(docker run -d ghcr.io\/sokrypton\/colabfold:1.5.5 cuda12.2.2 \/bin\/bash -c \"tail -f \/dev\/null\")<\/code><\/pre>\n\n\n\n<p>Copy the setup_databases.sh script to the relevant path in the container and create a databases directory:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>docker cp .\/setup_databases.sh $CONTAINER_ID:\/usr\/local\/envs\/colabfold\/bin\/ <\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>docker exec $CONTAINER_ID mkdir \/databases<\/code><\/pre>\n\n\n\n<p>Run the setup script. This will download and prepare the databases (~2TB once extracted):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>docker exec $CONTAINER_ID \/usr\/local\/envs\/colabfold\/bin\/setup_databases.sh \/databases\/ <\/code><\/pre>\n\n\n\n<p>Copy the databases back to the host and clean up:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>docker cp $CONTAINER_ID:\/databases .\/ <\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>docker stop $CONTAINER_ID\ndocker rm $CONTAINER_ID<\/code><\/pre>\n\n\n\n<p>You should now be at a stage where batch predictions can be run, for which I have provided a template script (uses a fasta file with multiple sequences) below. It\u2019s worth noting that maximum search speeds can be achieved by loading the database into memory and pre-indexing, but this requires about 1TB of RAM, which I don&#8217;t have. <\/p>\n\n\n\n<p>There are 2 key processes that I prefer to log separately, colabfold_search and colabfold_batch:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/bin\/bash\n\n# Define the paths for database, input FASTA, and outputs\n\ndb_path=\"path\/to\/database\"\ninput_fasta=\"path\/to\/fasta\/file.fasta\"\noutput_path=\"path\/to\/output\/directory\"\nlog_path=\"path\/to\/logs\/directory\"\ncache_path=\"path\/to\/weights\/cache\"\n\n# Run Docker container to execute colabfold_search and colabfold_batch \n\ntime docker run --gpus all -v \"${db_path}:\/database\" -v \"${input_fasta}:\/input.fasta\" -v \"${output_path}:\/predictions\" -v \"${log_path}:\/logs\" -v \"${cache_path}:\/cache\"\n ghcr.io\/sokrypton\/colabfold:1.5.5-cuda12.2.2 \/bin\/bash -c \"colabfold_search --mmseqs \/usr\/local\/envs\/colabfold\/bin\/mmseqs \/input.fasta \/database msas &gt; \/logs\/search.log 2&gt;&amp;1 &amp;&amp; colabfold_batch msas \/predictions &gt; \/logs\/batch.log 2&gt;&amp;1\"\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Alphafold is great, however it\u2019s not suited for large batch predictions for 2 main reasons. Firstly, there is no native functionality for predicting structures off multiple fasta sequences (although a custom batch prediction script can be written pretty easily). Secondly, the multiple sequence alignment (MSA) step is heavy and running MSAs for, say, 10,000 sequences [&hellip;]<\/p>\n","protected":false},"author":121,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[633,348,632,14,588,228],"tags":[407,762,763],"ppma_author":[756],"class_list":["post-11159","post","type-post","status-publish","format-standard","hentry","category-ai","category-bash","category-deep-learning","category-howto","category-linux-gnu-linux","category-protein-structure","tag-alphafold","tag-colabfold","tag-hpc"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":756,"user_id":121,"is_guest":0,"slug":"dylan","display_name":"Dylan Adlard","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/35565c3d1610dc574b979e9ca5f93f89663d1e1ec6457294b982aa49e288e23c?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/11159","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/121"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=11159"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/11159\/revisions"}],"predecessor-version":[{"id":11165,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/11159\/revisions\/11165"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=11159"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=11159"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=11159"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=11159"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}