Monthly Archives: November 2022

How to turn a SMILES string into an extended-connectivity fingerprint using RDKit

After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).

ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.

Continue reading

Unreasonably faster notes, with command-line fuzzy search

A good note system should act like a second brain:

  1. Accessible in seconds
  2. Adding information should be frictionless
  3. Searching should be exhaustive – if it’s there, you must find it

The benefits of such a note system are immense – never forget anything again! Search, perform the magic ritual of Copy Paste, and rejoice in the wisdom of your tried and tested past.

But how? Through the unreasonable effectiveness of interactive fuzzy search. This is how I have used Fuz, a terminal-based file fuzzy finder, for about 4 years.

Briefly, Fuz extracts all text within a directory using ripgrep, enables interactive fuzzy search with FZF, and returns you the selected item. As you type, the search results get narrowed down to a few matches. Files are opened at the exact line you found. And it’s FAST – 100,000 lines in half a second fast.

Using Fuz to quickly add a code-snippet in our note directory – then retrieving it with fuzzy-search. Here, on how to read FASTA files with Biopython, conveniently added to a file called biopython.py.
Continue reading

Naga101: A Guide to Getting Started with (OPIG) Slurm Servers

Over the past months, I’ve been working with a few new members of OPIG, which left me answering (and asking) lots of questions about working with Slurm. In this blog post, I will try to cover key, practical basics to interacting with servers that are set up on Slurm.

Over the past months, I’ve been working with a few new members of OPIG, which left me answering (and asking) lots of questions about working with Slurm. In this blog post, I will try to cover key, practical basics to interacting with servers that are set up on Slurm.

Slurm is a workload manager or job scheduler for Linux, meaning that it helps with allocating resources (eg CPUs and GPUs) on a server to users’ jobs.

To note, all of the commands and files shown here are run from a so-called ‘head’ node, from which you access Slurm servers.

1. Entering an interactive session

Unlike many other servers, you cannot access a Slurm server via ‘ssh’. Instead, you can enter an interactive (or ‘debug’) session – which, in OPIG, is limited to 30 minutes – via the srun command. This is incredibly useful for copying files, setting up environments and checking that your code runs.

srun -p servername-debug --pty --nodes=1 --ntasks-per-node=1 -t 00:30:00 --wait=0 /bin/bash

2. Submitting jobs

While the srun command is easy and helpful, many of the jobs we want to run on a server will take longer than the debug queue time limit. You can submit a job, which can then run for a longer (although typically still capped) time but is not interactive, via sbatch.

Continue reading

Coarse-grained models of antibody solutions

Various coarse-grained (CG) models have become increasingly common in studies of antibody-antibody interactions in solution. These models appear poised to enter development pipelines in the near future to help predict and understand how antibody-antibody interactions influence the suitability of a given monoclonal antibody (mAb) for mass production and delivery as an antibody therapy. This blog post is a non-exhaustive summary of some of the highlights I found during a recent literature search.

Continue reading

Supercharge Your Literature Review With These Tools

When starting a new project, conducting a literature review of the field can be one of the most daunting prospects. Not only do you need to get through a mountain of research papers, you also need to work out which mountain of papers to get through. You don’t want to start a project only to realise a few weeks (or months!) in that you missed a key paper which would have completely changed the course of your research. Luckily, there are now several handy tools which can help speed up this process.

Continue reading

Thinking of going to a conference

As so many members of the group have never attended an in-person conference, I thought it might be worth answering the question “why do people attend conferences?”

First- up, we should remember that flying around the world is not a zero cost to the planet, so all of us lucky enough to be able to travel should think hard every time before we choose to do so.

This means it’s really important to make sure that we know why we are going to any conference and maximise the benefits from attendance. Below are a few things to think about in terms of why you attend a conference and what to do when you are there, but this is definitely not a complete list, more a starter for four.

Continue reading

Am I better? Performance metrics unravelled

What’s the deal with all these numbers? Accuracy, Precision, Recall, Sensitivity, AUC and ROCs.

The basic stuff:

Given a method that produces a numerical outcome either catagorical (classification) or continuous (regression), we want to know how well our method did. Let’s start simple:

True positives (TP): You said something was a cow and it was in fact a cow – duh.

False positives (FP): You said it was a cow and it wasn’t – sad.

True negative (TN): You said it was not a cow and it was not – good job.

False negative (FN): You said it was not a cow but it was a cow – do better.

I can optimise these metrics artificially. Just call everything a cow and I have a 100% true positive rate. We are usually interested in a trade-off, something like the relative value of metrics. This gives us:

Continue reading