Category Archives: Code

Pairwise sequence identity and Tanimoto similarity in PDBbind

In this post I will cover how to calculate sequence identity and Tanimoto similarity between any pairs of complexes in PDBbind 2020. I used RDKit in python for Tanimoto similarity and the MMseqs2 software for sequence identity calculations.

A few weeks back I wanted to cluster the protein-ligand complexes in PDBbind 2020, but to achieve this I first needed to precompute the sequence identity between all pairs sequences in PDBbind, and Tanimoto similarity between all pairs of ligands. PDBbind 2020 includes 19.443 complexes but there are much fewer distinct ligands and proteins than that. However, I kept things simple and calculated the similarities for all 19.443*19.443 pairs. Calculating the Tanimoto similarity is relatively easy thanks to the BulkTanimotoSimilarity function in RDKit. The following code should do the trick:

from rdkit.Chem import AllChem, MolFromMol2File
from rdkit.DataStructs import BulkTanimotoSimilarity
import numpy as np
import os

fps = []
for pdb in pdbs:
    mol = MolFromMol2File(os.path.join('data', pdb, f'{pdb}_ligand.mol2'))
    fps.append(AllChem.GetMorganFingerprint(mol, 3))

sims = []
for i in range(len(fps)):
    sims.append(BulkTanimotoSimilarity(fps[i],fps))

arr = np.array(sims)
np.savez_compressed('data/tanimoto_similarity.npz', arr)

Sequence identity calculations in python with Biopandas turned out to be too slow for this amount of data so I used the ultra fast MMseqs2. The first step to running MMseqs2 is to create a .fasta file of all the sequences, which I call QUERY.fasta. This is what the first few lines look like:

Continue reading

Checking your PDB file for clashing atoms

Detecting atom clashes in protein structures can be useful in a number of scenarios. For example if you are just about to start some molecular dynamics simulation, or if you want to check that a structure generated by a deep learning model is reasonable. It is quite straightforward to code, but I get the feeling that these sort of functions have been written from scratch hundreds of times. So to save you the effort, here is my implementation!!!

Continue reading

Streamlining Your Terminal Commands With Custom Bash Functions and Aliases

If you’ve ever found yourself typing out the same long commands over and over again, or if you’ve ever wished you could teleport directly to your favourite directories, then this post is for you.

Before we jump into some useful examples, let’s go over what bash functions and aliases are, and how to set them up.

Bash Functions vs Aliases

A bash function is like a mini script stored in your .bashrc or .bash_profile file. It can accept arguments, execute a series of commands, and even return a value.

Continue reading

Unclear documentation? ChatGPT can help!

The PyMOL Python API is a useful resource for most people doing research in OPIG, whether focussed on antibodies, small molecule drug design or protein folding. However, the documentation is poorly structured and difficult to interpret without first having understood the structure of the module. In particular, the differences between use of the PyMOL command line and the API can be unclear, leading to a much longer debugging process for code than you’d like.

While I’m reluctant to continue the recent theme of ChatGPT-related posts, this is a use for ChatGPT that would have been incredibly useful to me when I was first getting to grips with the PyMOL API.

Continue reading

Coding a Progress Bar for your Google Slides Presentation

Presentations are a great opportunity to explain your work to a new audience and receive valuable feedback. A vital aspect of a presentation is keeping the audience’s attention which is generally quite tricky I have found (from experience).

One thing that I have noticed other presenters using, which has helped maintain my focus, is an indication of the progression of the presentation. Including in your slides information that there are only a few slides remaining, encourages the listeners to keep their focus for a little longer.

Instead I will show you how to do it using Apps Script, Google’s cloud platform that allows you to write JavaScript code which can work with its online products such as Docs or Slides.

Continue reading

BRICS Decomposition in 3D

Inspired by this blog post by the lovely Kate, I’ve been doing some BRICS decomposing of molecules myself. Like the structure-based goblin that I am, though, I’ve been applying it to 3D structures of molecules, rather than using the smiles approach she detailed. I thought it may be helpful to share the code snippets I’ve been using for this: unsurprisingly, it can also be done with RDKit!

I’ll use the same example as in the original blog post, propranolol.

1DY4: CBH1 IN COMPLEX WITH S-PROPRANOLOL

First, I import RDKit and load the ligand in question:

Continue reading

PLIP on PDBbind with Python

Today’s blog post is about using PLIP to extract information about interactions between a protein and ligand in a bound complex, using data from PDBbind. The blog post will cover how to combine the protein pdb file and the ligand mol2 file into a pdb file, and how to use PLIP in a high-throughput manner with python.

In order for PLIP to consider the ligand as one molecule interacting with the protein, we need to modify the mol2 file of the ligand. The 8th column of the atom portion of a mol2 file (the portion starts with @<TRIPOS>ATOM) includes the ID of the ligand that the atom belongs to. Most often all the atoms have the same ligand ID, but for peptides for instance, the atoms have the ID of the residue they’re part of. The following code snippet will make the required changes:

ligand_file = 'data/5oxm/5oxm_ligand.mol2'

with open(ligand_file, 'r') as f:
    ligand_lines = f.readlines()

mod = False
for i in range(len(ligand_lines)):
    line = ligand_lines[i]
    if line == '@&lt;TRIPOS&gt;BOND\n':
        mod = False
        
    if mod:
        ligand_lines[i] = line[:59] + 'ISK     ' + line[67:]
        
    if line == '@&lt;TRIPOS&gt;ATOM\n':
        mod = True

with open('data/5oxm/5oxm_ligand_mod.mol2', 'w') as g:
    for j in ligand_lines:
        g.write(j)
Continue reading

The ultimate modulefile for conda

Environment modules is a great tool for high-performance computing as it is a modular system to quickly and painlessly enable preset configurations of environment variables, for example a user may be provided with modulefile for an antiquated version of a tool and a bleeding-edge alpha version of that same tool and they can easily load whichever they wish. In many clusters the modules are created with a tool called EasyBuild, which delivered an out-of-the-box installation. This works for things like a single binary, but for conda this severely falls short as there are many many configuration changes needed.

Continue reading

BRICS Decomposition and Synthetic Accessibility

Recently I’ve been thinking a lot about how to decompose a compound into smaller fragments specifically for a retrosynthetic purpose. My question is: given a compound, can I return building blocks that are likely to synthesize together to produce this compound simply by breaking likely bonds formed in a reaction? A method that is nearly 15 years old named, breaking of retrosynthetically interesting chemical substructures (BRICS), is one approach to do this. Here I’ll explore how BRICS can reflect synthetic accessibility.

Continue reading

LaTeX Beamer Template with Logos

Alternative Title: The tragic story of how I got trapped making slides with latex.

Typically after giving a presentation at least one person will approach me and ask if they could have access to my custom latex template to make slides with beamer that don’t look rubbish.

TL;DR Yes you can: https://github.com/npqst/latex-beamer-template

Continue reading