No labels, no problem! A quick introduction to Gaussian Mixture Models

Statistical Modelling Big Data AnalyticsTM is in vogue at the moment, and there’s nothing quite so fashionable as the neural network. Capable of capturing complex non-linear relationships and scalable for high-dimensional datasets, they’re here to stay.

For your garden-variety neural network, you need two things: a set of features, X, and a label, Y. But what do you do if labelling is prohibitively expensive or your expert labeller goes on holiday for 2 months and all you have in the meantime is a set of features? Happily, we can still learn something about the labels, even if we might not know what they are!

Continue reading

K-Means clustering made simple

The 21st century is often referred to as the age of “Big Data” due to the unprecedented increase in the volumes of data being generated. As most of this data comes without labels, making sense of it is a non-trivial task. To gain insight from unlabelled data, unsupervised machine learning algorithms have been developed and continue to be refined. These algorithms determine underlying relationships within the data by grouping data points into cluster families. The resulting clusters not only highlight associations within the data, but they are also critical for creating predictive models for new data.

Continue reading

Real Space Correlation Coefficient

Introduction

In crystalography we are often faced with the question of how well a part of our model fits the data. Now crystalography has well developed probability models for the reflection amplitudes given then entire fitted model, but these do not provide a metric for “how much of the ligand is inside the blob”. This is because the reflection based models are inherently global.

Continue reading

ICML 2020: Chemistry / Biology papers

ICML is one of the largest machine learning conferences and, like many other conferences this year, is running virtually from 12th – 18th July.

The list of accepted papers can be found here, with 1,088 papers accepted out of 4,990 submissions (22% acceptance rate). Similar to my post on NeurIPS 2019 papers, I will highlight several of potential interest to the chem-/bio-informatics communities. As before, given the large number of papers, these were selected either by “accident” (i.e. I stumbled across them in one way or another) or through a basic search (e.g. Ctrl+f “molecule”).

Continue reading

Uploading/downloading small files across systems

Sometimes you just want to quickly move a copy of a script, image or binary from, for example, your local (linux) machine to another (linux) machine. The usual tool would be SCP, but this can get complicated when there are several layers of ssh and sometimes it doesn’t work at all (as is the case for transfers between the Department of Statistics computers and the outside world).

Continue reading

ProCare: cavity similarity searching and its applications to fragment-based drug design

ProCare [1] is a package developed at the University of Strasbourg which is able to align and score the similarity of protein cavities. The aim is to find ligand binding sites between different proteins that are similar enough to bind the same ligand. The method used in ProCare is designed to look particularly at fragment (~⅓ size of a druglike ligand) binding sites. The aim is to predict potential fragment hits by comparing the cavities of the targets.

Continue reading

Journal Club: the Dynamics of Affinity Maturation

Last week at our group meeting I presented on a paper titled “T-cell Receptor Variable beta Domains Rigidify During Affinity Maturation” by Monica L. Fernández-Quintero, Clarissa A. Seidler and Klaus R. Liedl. The authors use metadynamics simulations of the same T-cell Receptor (TCR) at different stages of affinity maturation to study the conformational landscape of the complementarity-determining regions (CDRs), and how this might relate to an increase in affinity. Not only do they conclude that affinity maturation leads to rigidification of CDRs in solution, but they also present some evidence for the conformational selection model of biomolecular binding events in TCR-antigen interactions.

Continue reading

EEGor on Proteins: A Brain-based Perspective on Crowd-sourced Protein Structure Prediction

EEG-based Brain-Computer Interfaces (BCIs) are becoming increasingly popular, with products such as the Muse Headband and g-tec’s Unicorn Hybrid Black taking off, while in the protein folding space, Fold It and distributed/crowd computing efforts like Fold@home, don’t seem to be talked about as much as they once were.

Game-ification is still just as effective a tool to harness human ingenuity as it once was, so perhaps what is needed is a new approach to crowd-folding efforts that can tap into the full potential of the human mind to manipulate and visualise new 3D structures, by drawing inspiration directly from the minds of users…

Continue reading