Category Archives: Data Visualization

Pitfalls of using Pearson’s correlation for comparing model performance

Pearson’s R (correlation coefficient) is a measure of the linear correlation between two variables, giving a value between -1 and 1, where 1 is total positive linear correlation, 0 is no linear correlation, and -1 is total negative linear correlation. While it’s a useful statistic for understanding the relationship between two variables, it is often used to compare the performance of two or more models. For example, imagine we had experimental values that we are predicting and several models’ predictions. Obviously, we would prefer the model with the highest Pearson’s R … or perhaps not?

Continue reading

Making your figures more accessible

You might have created the most esthetic figures for your last presentation with a beautiful colour scheme, but have you considered how these might look to someone with colourblindness? Around 5% of the gerneral population suffer from some kind of color vision deficiency, so making your figures more accessible is actually quite important! There are a range of online tools that can help you create figures that look great to everyone.

Continue reading

Plotext: The Matplotlib Lookalike That Breaks Free from X Servers

Imagine this: you’ve spent days computing intricate analyses, and now it’s time to bring your findings to life with a nice plot. You fire up your cluster job, scripts hum along, and… matplotlib throws an error, demanding an X server it can’t find. Frustration sets in. What a waste of computation! What happened? You just forgot to add the -X to your ssh command, or it may be just that X forwarding is not allowed in your cluster. So you will need to rerun your scripts, once you have modified them to generate a file that you can copy to your local machine rather than plotting it directly.

But wait! Plotext to the rescue! This Python package provides an interface nearly identical to matplotlib, allowing you to seamlessly transition your plotting code without sacrificing functionality. But why choose Plotext over the familiar matplotlib? The key lies in its text-based backend. This means it is just printing characters in your console to generate the plots, making it ideal for cluster environments where X servers are often absent or restricted. What do those plots look like? Here is an example:

Continue reading

Some useful pandas functions

Pandas is one of the most used packages for data analysis in python. The library provides functionalities that allow to perfrom complex data manipulation operations in a few lines of code. However, as the number of functions provided is huge, it is impossible to keep track of all of them. More often than we’d like to admit we end up wiriting lines and lines of code only to later on discover that the same operation can be performed with a single pandas function.

To help avoiding this problem in the future, I will run through some of my favourite pandas functions and demonstrate their use on an example data set containing information of crystal structures in the PDB.

Continue reading

A simple criterion can conceal a multitude of chemical and structural sins

We’ve been investigating deep learning-based protein-ligand docking methods which often claim to be able to generate ligand binding modes within 2Å RMSD of the experimental one. We found, however, this simple criterion can conceal a multitude of chemical and structural sins…

DeepDock attempted to generate the ligand binding mode from PDB ID 1t9b
(light blue carbons, left), but gave pretzeled rings instead (white carbons, right).

Continue reading

PHinally PHunctionalising my PHigures with PHATE feat. Plotly Express.

After being recommended by a friend, I really wanted to try plotly express but I never had the inclination to read more documentation when matplotlib gives me enough grief. While experimenting with ChatGPT I finally decided to functionalise my figure making scripts. With these scripts I manage to produce figures that made people question what I had actually been doing with my time – but I promise this will be worth your time.

I have been using with dimensionality reducition techniques recently and I came across this paper by Moon et al. PHATE is a technique that represents high dimensional (ie biological) data in a way that aims to preserve connections over preserving distance and I knew I wanted to try this as soon as I saw it. Why should you care? PHATE in 3D is faster that t-SNE in 2D. It would almost be rude to not try it out.

PHATE

In my opinion PHATE (or potential of heat diffusion for affinity-based transition embedding) does have a lot going on but that the choices at each stage feel quite sensisble. It might not come as a surprise this was primarily designed to make visual inspection of data easier on the eyes.

Continue reading

An Overview of Clustering Algorithms

During the first 6 months of my DPhil, I worked on clustering antibodies and I thought I would share what I learned about these algorithms. Clustering is an unsupervised data analysis technique that groups a data set into subsets of similar data points. The main uses of clustering are in exploratory data analysis to find hidden patterns or data compression, e.g. when data points in a cluster can be treated as a group. Clustering algorithms have many applications in computational biology, such as clustering antibodies by structural similarity. Actually, this is objectively the most important application and I don’t see why anyone would use it for anything else.

There are several types of clustering algorithms that offer different advantages.

Continue reading

The Most ReLU-iable Activation Function?

The Rectified Linear Unit (ReLU) activation function was first used in 1975, but its use exploded when it was used by Nair & Hinton in their 2010 paper on Restricted Boltzmann Machines. ReLU and its derivative are fast to compute, and it has dominated deep neural networks for years. The main problem with the activation function is the so-called dead ReLU problem, where significant negative input to a neuron can cause its gradient to always be zero. To rectify this (har har), modified versions have proposed, including leaky ReLU, GeLU and SiLU, wherein the gradient for x < 0 is not always zero.

A 2020 paper by Naizat et al., which builds upon ideas set out in a 2014 Google Brain blog post seeks to explain why ReLU and its variants seem to be better in general for classification problems than sigmoidal functions such as tanh and sigmoid.

Continue reading

Quality Stats

Disclaimer – the title is a Quality Street pun only and bears no relation to the quality of the data or analysis presented below. This whole blog post is basically to discredit the personal chocolate preferences of a group member who shall remain nameless. Safe to say though, they Vostly overestimated people’s love for the Toffee Finger. Long live the Orange Creme.

Continue reading

histo.fyi: A Useful New Database of Peptide:Major Histocompatibility Complex (pMHC) Structures

pMHCs are set to become a major target class in drug discovery; unusual peptide fragments presented by MHC can be used to distinguish infected/cancerous cells from healthy cells more precisely than over-expressed biomarkers. In this blog post, I will highlight a prototype resource: Dr. Chris Thorpe’s new database of pMHC structures, histo.fyi.

histo.fyi provides a one-stop shop for data on (currently) around 1400 pMHC complexes. Similar to our dedicated databases for antibody/nanobody structures (SAbDab) and T-cell receptor (TCR) structures (STCRDab), histo.fyi will scrape the PDB on a weekly basis for any new pMHC data and process these structures in a way that facilitates their analysis.

Continue reading