An A-Z of Oxford

The 2021/2 academic year is now well underway in Oxford, which means a fresh batch of new students getting to grips with some of the bewildering terminology employed here, as well as prospective applicants for next year trying to figure out what on earth a college is and which one they should apply to. As a wizened final year DPhil student I decided to compile an A-Z of Oxford related terms in the hope that someone might find it useful.

A – Ashmolean Museum

Britain’s first public museum, established all the way back in 1678. Home to exhibits covering Ancient Egypt to Modern Art and everything in between.

The Ashmolean Museum of Art and Archaeology | Art UK
The front of the Ashmolean, right in the middle of Oxford City Centre

B – Battels

A termly bill students receive from their college which might cover things like charges for food and accommodation, or fines for not returning books to the library on time.

C – College

The 39 colleges are small educational institutions which together comprise the University of Oxford. Every student is a member of a college, each of which has their own set of facilities, including a dining hall, bar, library and student accommodation. Colleges also have their own student unions, called the Junior Common Room (for undergraduates) and Middle Common Room (for postgraduates), which are excellent places to socialise and meet people studying lots of different subjects.

Aerial view of Oxford, UK, a very well preserved city with one of the most  beautiful university campuses I know about.: ArchitecturalRevival
An aerial view of many of the university’s colleges
Continue reading

Using normalized SuCOS scores.

If you are working in cheminformatics or utilise protein-ligand docking, then you should be aware of the SuCOS score, an open-source shape and chemical feature overlap metric designed by a former member of OPIG: Susan Leung.

The metric compares the 3D conformers of two ligands based on their shape overlap as well as their chemical feature overlap using the RDKit toolkit. Leung et al. show that SuCOS is able to select fewer false positives and false negatives when doing re-docking studies than other scoring metrics such as RMSD or Protein Ligand Interaction Fingerprints (PLIF) similarity scores and performs better at differentiating actives from decoys when tested on the DUD-E dataset.

Most importantly, SuCOS was designed with fragment based drug discovery in focus, where a smaller fragment ligand is elaborated or combined with other fragments to create a larger molecule, with hopefully stronger binding affinity. Unlike for example RMSD, SuCOS is able to quickly calculate an overlap score between a small fragment and a larger molecule, giving chemists an idea on how the fragment elaboration might interact with the protein. However, the original SuCOS algorithm was not normalized and could create scores of > 1 for some cases.

I’ve uploaded a normalised version of the original SuCOS algorithm as a GitHub fork of Susan’s original repository. You can find the normalised SuCOS algorithm here.

Hopefully this is helpful for anyone using the SuCOS algorithm and for all docking enthusiasts who are interested in an alternative way to evaluate their docked poses.

An idea by any other name would smell as sweet.

A blog post about ideas.

Ideation is the formation of an idea, but how do we ideate? 

The route of the word is “to see”, so when we have an idea we see something. In that moment of realization, we hold on to something quite abstract. Some describe it as a click or pattern or insight. This “seeing” is with the mind, however, not the eyes. Idea also implies sentiment or direction – a path one might say. It’s this last point that resonates with me most. When we are lost, in the sea of thoughts, most of the time the consequences are immediate (no consciousness required). However, sometimes we must pause and ideate. Our path, the next step, is unclear. 

Continue reading

Monty Python

Every now and then I decide to overthink a problem I thought I understood and get confused – last week, it was the Monty Hall problem. 

For those unfamiliar with the thought experiment, the basic premise is that you are on a game show and are presented with three doors. Behind one of the doors is a car, while behind the other two are goats. 

With zero initial information, you make a guess as to which door you think the car is behind (we assume you have enough goats already). Before looking behind your chosen door, the host opens one of the remaining two doors and reveals a goat. The host then asks you if you would like to change your guess. What should you do? 

Continue reading

Getting the PDB structures of compounds in ChEMBL

Recently I was dealing with a set of compounds with known target activities from the ChEMBL database, and I wanted to find out which of them also had PDB  crystal structures in complex with that target.

Referencing this manually is very easy for cases where we are interested in 2-3 compounds, but for any larger number, using the ChEMBL and PDB web services greatly reduces the number of clicks.

Continue reading

Issues with graph neural networks: the cracks are where the light shines through

Deep convolutional neural networks have lead to astonishing breakthroughs in the area of computer vision in recent years. The reason for the extraordinary performance of convolutional architectures in the image domain is their strong ability to extract informative high-level features from visual data. For prediction tasks on images, this has lead to superhuman performance in a variety of applications and to an almost universal shift from classical feature engineering to differentiable feature learning.

Unfortunately, the picture is not quite as rosy yet in the area of molecular machine learning. Feature learning techniques which operate directly on raw molecular graphs without intermediate feature-engineering steps have only emerged in the last few years in the form of graph neural networks (GNNs). GNNs, however, still have not managed to definitively outcompete and replace more classical non-differentiable molecular representation methods such as extended-connectivity fingerprints (ECFPs). There is an increasing awareness in the computational chemistry community that GNNs have not quite lived up to the initial hype and still suffer from a number of technical limitations.

Continue reading

Command-Line Interfaces (CLIs), argparse.ArgumentParser and some of my tricks.

Command-Line Interfaces (CLIs) are one of the best ways of providing your programs with useful parameters to customize their execution. If you are not familiar with CLI, in this blog post we will introduce them. Let’s say that you have a program that reads a file, computes something, and then, writes the results into another file. The simplest way of providing those arguments would be:

$ python mycode.py my/inputFile my/outputFile
### mycode.py ###
def doSomething(inputFilename):
    with open(inputFilename) as f:
        return len(f.readlines())

if __name__ == "__main__":
    #Notice that the order of the arguments is important
    inputFilename = sys.argv[1]
    outputFilename = sys.argv[2]

    with open(outputFilename, "w") as f:
        f.write( doSomething(inputFilename))
Continue reading

Multiple Testing: What is it, why is it bad and how can we avoid it?

P-values play a central role in the analysis of many scientific experiments. But, in 2015, the editors of the Journal of Basic and Applied Social Psychology prohibited the usage of p-values in their journal. The primary reason for the ban was the proliferation of results obtained by so-called ‘p-hacking’, where a researcher tests a range of different hypotheses and publishes the ones which attain statistical significance while discarding the others. In this blog post, we’ll show how this can lead to spurious results and discuss a few things you can do to avoid engaging in this nefarious practice.

The Basics: What IS a p-value?

Under a Hypothesis Testing framework, a p-value associated with a dataset is defined as the probability of observing a result that is at least as extreme as the observed one, assuming that the null hypothesis is true. If the probability of observing such an event is extremely small, we conclude that it is unlikely the null hypothesis is true and reject it.

But therein lies the problem. Just because the probability of something is small, that doesn’t make it impossible. Using the standard significance test threshold of 0.05, even if the null hypothesis is true, there is a 5% chance of obtaining a p-value below the significance threshold and therefore rejecting it. Such false positives are an inescapable part of research; there’s always a possibility that the subset you were working with isn’t representative of the global data and sometimes we take the wrong decision even though we analysed the data in a perfectly rigorous fashion.

Continue reading

Unraveling the role of entanglement in protein misfolding

Proteins that fail to fold correctly may populate misfolded conformations with disparate structure and function. Misfolding is the focus of intense research interest due to its putative and confirmed role in various diseases, including neurodegenerative diseases such as Parkinson’s and Alzheimer’s Diseases as well as cystic fibrosis (PMID: 16689923).

Many open questions about protein misfolding remain to be answered. For example, how do misfolded proteins evade cellular quality control mechanisms like chaperones to remain soluble but non-functional for long timescales? How long do misfolded states persist on average? How widespread is misfolding? Experiments indicate that misfolding can even be caused by synonymous mutations that alter the speed of protein translation but not the sequence of the protein produced (PMID: 23417067), introducing the additional puzzle of how the protein maintains a “memory” of its translation kinetics after synthesis is complete.

A series of four recent preprints (Preprints 1, 2, 3, and 4, see below) suggests that these questions can be answered by the partitioning of proteins into long-lived self-entangled conformations that are structurally similar to the native state but with perturbed function. Simulation of the synthesis, termination, and post-translational dynamics of a large dataset of E. coli proteins suggests that misfolding and entanglement are widespread, with two thirds of proteins misfolding some of the time (Preprint 1). Many misfolded conformations may bypass proteostasis machinery to remain soluble but non-functional due to their structural similarity to the native state. Critically, entanglement is associated with particularly long-lived misfolded states based on simulated folding kinetics.

Coarse-grain and all-atom simulation results indicate that these misfolded conformations interact with chaperones like GroEL and HtpG to a similar extent as does the native state (Preprint 2). These results suggest an explanation for why some protein always fails to refold while remaining soluble, even in the presence of multiple folding chaperones – it remains trapped in entangled conformations that resemble the native state and thus fail to recruit chaperones.

Finally, simulations indicate that changes to the translation kinetics of oligoribonuclease introduced by synonymous mutations cause a large change in its probability of entanglement at the dimerization interface (Preprint 3). These entanglements localized at the interface alter its ability to dimerize even after synthesis is complete. These simulations provide a structural explanation for how translation kinetics can have a long-timescale influence on protein behavior.

Together, these preprints suggest that misfolding into entangled conformations is a widespread phenomenon that may provide a consistent explanation for many unanswered question in molecular biology. It should be noted that entanglement is not exclusive to other types of misfolding, such as domain swapping, that may contribute to misfolding in cells. Experimental validation of the existence of entangled conformations is a critical aspect of testing this hypothesis; for comparisons between simulation and experiment, see Preprint 4.

Preprint 1: https://www.biorxiv.org/content/10.1101/2021.08.18.456613v1

Preprint 2: https://www.biorxiv.org/content/10.1101/2021.08.18.456736v1

Preprint 3: https://www.biorxiv.org/content/10.1101/2021.10.26.465867v1

Preprint 4: https://www.biorxiv.org/content/10.1101/2021.08.18.456802v1

How to interact with small molecules in Jupyter Notebooks

The combination of Python and the cheminformatics toolkit RDKit has opened up so many ways to explore chemistry on a computer. Jupyter — named for the three languages, Julia, Python, and R — ties interactivity and visualization together, creating wonderful environments (Notebooks and JupyterLab) to carry out, share and reproduce research, including:

“data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.”

—https://jupyter.org

At this year’s annual RDKit UGM (User Group Meeting), Cédric Bouysset shared a tutorial explaining how to create a grid of molecules that you can interact with, using his “mols2grid“:

Continue reading