Category Archives: Code

LaTeX Beamer Template with Logos

Alternative Title: The tragic story of how I got trapped making slides with latex.

Typically after giving a presentation at least one person will approach me and ask if they could have access to my custom latex template to make slides with beamer that don’t look rubbish.

TL;DR Yes you can: https://github.com/npqst/latex-beamer-template

Continue reading

Atom mapping with RXNMapper

When recently looking at some reaction data, I was confronted with the problem of atom-to-atom mapping (AAM) and what tools are available to tackle it. AAM refers to the process of mapping individual atoms in reactants to their corresponding atoms in the products, which is important for defining a reaction template and identifying which bonds are being formed and broken. This has many downstream uses for computational chemists, such as for reaction searching and forward and retrosynthesis planning1. The problem is that many reaction databases do not contain these mappings, and annotation by expert chemists is impractical for databases containing thousands (or more) data points.

Continue reading

How to easily use pharmacophoric atom features to turn ECFPs into FCFPs

Today’s post builds on my earlier blogpost on how to turn a SMILES string into an extended-connectivity fingerprint using RDKit and describes an interesting and easily implementable modification of the extended-connectivity fingerprint (ECFP) featurisation. This modification is based on representing the atoms in the input compound at a different (and potentially more useful) level of abstraction.

We remember that each binary component of an ECFP indicates the presence or absence of a particular circular subgraph in the input compound. Circular subgraphs that are structurally isomorphic are further distinguished according to their inherited atom- and bond features, i.e. two structurally isomorphic circular subgraphs with distinct atom- or bond features correspond to different components of the ECFP. For chemical bonds, this distinction is made on the basis of simple bond types (single, double, triple, or aromatic). To distinguish atoms, standard ECFPs use seven features based on the Daylight atomic invariants [1]; but there is also another less commonly used and often overlooked version of the ECFP that uses pharmacophoric atom features instead [2]. Pharmacophoric atom features attempt to describe atomic properties that are critical for biological activity or binding to a target protein. These features try to capture the potential for important chemical interactions such as hydrogen bonding or ionic bonding. ECFPs that use pharmacophoric atom features instead of standard atom features are called functional-connectivity fingerprints (FCFPs). The exact sets of standard- vs. pharmacophoric atom features for ECFPs vs. FCFPs are listed in the table below.

In RDKit, ECFPs can be changed to FCFPs extremely easily by changing a single input argument. Below you can find a Python/RDKit implementation of a function that turns a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False.

# import packages
import numpy as np
from rdkit.Chem import AllChem

# define function that transforms a SMILES string into an FCFP if use_features = True and into an ECFP if use_features = False
def FCFP_from_smiles(smiles,
                     R = 2,
                     L = 2**10,
                     use_features = True,
                     use_chirality = False):
    """
    Inputs:
    
    - smiles ... SMILES string of input compound
    - R ... maximum radius of circular substructures
    - L ... fingerprint-length
    - use_features ... if true then use pharmacophoric atom features, if false then use standard DAYLIGHT atom features
    - use_chirality ... if true then append tetrahedral chirality flags to atom features
    
    Outputs:
    - np.array(feature_list) ... FCFP/ECFP with length L and maximum radius R
    """
    
    molecule = AllChem.MolFromSmiles(smiles)
    feature_list = AllChem.GetMorganFingerprintAsBitVect(molecule,
                                                         radius = R,
                                                         nBits = L,
                                                         useFeatures = use_features,
                                                         useChirality = use_chirality)
    return np.array(feature_list)

The use of pharmacophoric atom features makes FCFPs more specific to molecular interactions that drive biological activity. In certain molecular machine-learning applications, replacing ECFPs with FCFPs can therefore lead to increased performance and decreased learning time, as important high-level atomic properties are presented to the learning algorithm from the start and do not need to be inferred statistically. However, the standard atom features used in ECFPs contain more detailed low-level information that could potentially still be relevant for the prediction task at hand and thus be utilised by the learning algorithm. It is often unclear from the outset whether FCFPs will provide a substantial advantage over ECFPs in a given application; however, given how easy it is to switch between the two, it is almost always worth trying out both options.

[1] Weininger, David, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for generation of unique SMILES notation.” Journal of Chemical Information and Computer Sciences 29.2 (1989): 97-101.

[2] Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of Chemical Information and Modeling 50.5 (2010): 742-754.

Datamining Wikipedia and writing JS with ChatGTP just to swap the colours on university logos…

I am not sure the University of Oxford logo works in the gold from the University of Otago…

A few months back I moved from the Oxford BRC to OPIG, both within the university of Oxford, but like many in academia I have moved across a few universities. As this is my first post here I wanted to do something neat: a JS tool that swapped colours in university logos!
It was a rather laborious task requiring a lot of coding, but once I got it working, I ended up tripping up at the last metre. So for technical reasons, I have resorted to hosting it in my own blog (see post), but nevertheless the path towards it is worth discussing.

Continue reading

Does ChatGPT know how to translate images?

Yesterday I spent a couple of hours playing with ChatGPT. I know, we have some other recent posts about it. It’s so amazing that I couldn’t resist writing another. Apologies for that.  

The goal of this post is to determine if I can effectively use ChatGPT as a programmer/mathematician assistant. OK. It was not my original intention, but let’s pretend it was, just to make this post more interesting.

So, I started asking a few very simple programming answers like the following: 

Can you implement a function to compute the factorial of a number using a cache? Use python.

And this is what I got.

A clear and efficient implementation of the factorial. This is the kind of answer you would expect from a first year CS student.

Continue reading

Cleaning outliers in conductance timeseries from molecular dynamics

Have you ever had an annoying dataset that looks something like this?

or even worse, just several of them

In this blog post, I will introduce basic techniques you can use and implement with Python to identify and clean outliers. The objective will be to get something more eye-pleasing (and mostly less troublesome for further data analysis) like this

Continue reading

How to turn a SMILES string into an extended-connectivity fingerprint using RDKit

After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).

ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.

Continue reading

Unreasonably faster notes, with command-line fuzzy search

A good note system should act like a second brain:

  1. Accessible in seconds
  2. Adding information should be frictionless
  3. Searching should be exhaustive – if it’s there, you must find it

The benefits of such a note system are immense – never forget anything again! Search, perform the magic ritual of Copy Paste, and rejoice in the wisdom of your tried and tested past.

But how? Through the unreasonable effectiveness of interactive fuzzy search. This is how I have used Fuz, a terminal-based file fuzzy finder, for about 4 years.

Briefly, Fuz extracts all text within a directory using ripgrep, enables interactive fuzzy search with FZF, and returns you the selected item. As you type, the search results get narrowed down to a few matches. Files are opened at the exact line you found. And it’s FAST – 100,000 lines in half a second fast.

Using Fuz to quickly add a code-snippet in our note directory – then retrieving it with fuzzy-search. Here, on how to read FASTA files with Biopython, conveniently added to a file called biopython.py.
Continue reading

How to build a Python dictionary of residues for each molecule in PyMOL

Sometimes it can be handy to work with multiple structures in PyMOL using Python.

Here’s a snippet of code you might find useful: we iterate over all the α-carbon atoms in a protein and append to a list tuples such as (‘GLY’, 1). The dictionary, ‘reslist’, returns a list of residue names and indices for each molecule, where the key is a string containing the name of the molecule.

from pymol import cmd

# Create a list of all the objects, called 'mpls':
mols = cmd.get_object_list('*')

# Create an empty dictionary that will return a list of residues
# given the name of the molecule object
reslist = {}

# Set the dictionaries to be empty lists
for m in mols:  reslist[m] = []

# Use PyMOL's iterate command to go over every α-Carbon and append 
# a tuple consisting of the each residue's residue name ('resn') and
# residue index ('resi '):
for m in mols:  cmd.iterate('%s and n. ca'%m, 'reslist["%s"].append((resn,int(resi)))'%m)

This script assumes you only have protein molecules loaded, and ignores things like chain ID and insertion codes.

Once you have your list of residues, you can use it with the cmd.align command, e.g., to align a particular residue to a reference structure.