Category Archives: Statistics

PHinally PHunctionalising my PHigures with PHATE feat. Plotly Express.

After being recommended by a friend, I really wanted to try plotly express but I never had the inclination to read more documentation when matplotlib gives me enough grief. While experimenting with ChatGPT I finally decided to functionalise my figure making scripts. With these scripts I manage to produce figures that made people question what I had actually been doing with my time – but I promise this will be worth your time.

I have been using with dimensionality reducition techniques recently and I came across this paper by Moon et al. PHATE is a technique that represents high dimensional (ie biological) data in a way that aims to preserve connections over preserving distance and I knew I wanted to try this as soon as I saw it. Why should you care? PHATE in 3D is faster that t-SNE in 2D. It would almost be rude to not try it out.

PHATE

In my opinion PHATE (or potential of heat diffusion for affinity-based transition embedding) does have a lot going on but that the choices at each stage feel quite sensisble. It might not come as a surprise this was primarily designed to make visual inspection of data easier on the eyes.

Continue reading

The Boltzmann Distribution and Gender Stereotypes

Journalist Caitlin Moran recently tweeted the following:

“I feel like every day now, I read/hear something saying “We don’t talk about what’s POSITIVE about masculinity; what’s GOOD about men and boys.” So: what IS the best stuff about boys, and men? Honest, celebratory question.”

What followed was a collection of replies acknowledging and celebrating various traits seen typically as ‘male’, including certain activities, such as knowing about sports or cars, or a desire to do DIY type work, and characteristics such as physical strength, no-nonsense attitudes and a ‘less complicated’ style of friendship between men.

Whilst I condone Moran’s efforts to turn recent discussions surrounding masculinity on their head and frame it in a positive light, to me the the responses offered and discussion that followed felt somewhat stifling. I am biologically male and identify as male, but do not feel like I personally adhere to most of these stereotypes. I am not physically strong, I know very little about cars and sports, and find there be just as much nuance and drama in male-male friendships as there is in friendships between other genders. 

Continue reading

Cleaning outliers in conductance timeseries from molecular dynamics

Have you ever had an annoying dataset that looks something like this?

or even worse, just several of them

In this blog post, I will introduce basic techniques you can use and implement with Python to identify and clean outliers. The objective will be to get something more eye-pleasing (and mostly less troublesome for further data analysis) like this

Continue reading

Am I better? Performance metrics unravelled

What’s the deal with all these numbers? Accuracy, Precision, Recall, Sensitivity, AUC and ROCs.

The basic stuff:

Given a method that produces a numerical outcome either catagorical (classification) or continuous (regression), we want to know how well our method did. Let’s start simple:

True positives (TP): You said something was a cow and it was in fact a cow – duh.

False positives (FP): You said it was a cow and it wasn’t – sad.

True negative (TN): You said it was not a cow and it was not – good job.

False negative (FN): You said it was not a cow but it was a cow – do better.

I can optimise these metrics artificially. Just call everything a cow and I have a 100% true positive rate. We are usually interested in a trade-off, something like the relative value of metrics. This gives us:

Continue reading

5th Artificial Intelligence in Chemistry Symposium

The lineup for the Royal Society of Chemistry’s 5th “Artificial Intelligence in Chemistry” Symposium (Thursday-Friday, 1st-2nd September 2022) is now complete for both oral and poster presentations. It really is a fantastic selection of topics and speakers and it is clear this event is now a highlight of the scientific calendar. Our very own Prof. Charlotte M. Deane, MBE will be giving a keynote.

5th RSC-BMCS/RSC-CICAG Airtificial Intelligence in Chemistry Symposium, 1st-2nd September, Churchill College, Cambridge + Zoom broadcast.

It marks a return to in-person meetings: it will be held at Churchill College, Cambridge, with a conference dinner at Trinity Hall.

More details are here: https://www.rscbmcs.org/events/aichem22/.

Registration for in person attendance is open until Monday 29th August 17:00 (BST).

It is also possible to register for virtual attendance; the meeting will be broadcast on Zoom.

Benford’s law and OAS

Benford’s law is an observation that in numerical data (produced by many kinds of process), the leading digit tends to be small. Wikipedia tells you that it in datasets obeying Benford’s law, the number 1 appears as the leading digit about 30% of the time while 9 appears less than 5% of the time (p(n) = log10(1+1/n) where n is the leading digit). Wikipedia further lists multiple kinds of data where this tends to be true such as electricity bills, population numbers and physical and mathematical constants, and particularly where data can be described by a power law.

Power laws and antibodies have been co-discussed in reference to network descriptions of antigen-experienced BCR repertoires [1], which are often described as scale-free to use the network terminology (following a power law). This means a few highly-connected nodes in the network and lots of nodes with few or no connections. This is an obvious candidate for Benford’s law.

This is of no practical relevance, but I wondered if I could see Benford’s law in other kinds of data besides clone counts in the Observed Antibody Space (OAS). For example, I looked at the leading digit in the number of sequences in all of the data units in OAS. It looks like a good fit for Benford’s law (though with more density at the smaller leading digits) and has a chi-squared value of 0.007 (Figure 1A).

Continue reading

Why can a man not lift himself by pulling up on his bootstrap hypothesis test?

This blogpost highlights a typical mistake when performing the bootstrap hypothesis test. Bootstrapping is a method of resampling data to estimate measures of variability, such as confidence intervals or variance. 

In the simplest form of the bootstrap, assume you have a set of values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You want to estimate the mean and variability of the mean using these data. The recipe is as follows:

Continue reading

Non-linear Dependence? Mutual Information to the Rescue!

We are all familiar with the idea of a correlation. In the broadest sense of the word, a correlation can refer to any kind of dependence between two variables. There are three widely used tests for correlation:

  • Spearman’s r: Used to measure a linear relationship between two variables. Requires linear dependence and each marginal distribution to be normal.
  • Pearson’s ρ: Used to measure rank correlations. Requires the dependence structure to be described by a monotonic relationship
  • Kendall’s 𝛕: Used to measure ordinal association between variables.

While these three measures give us plenty of options to work with, they do not work in all cases. Take for example the following variables, Y1 and Y2. These might be two variables that vary in a concerted manner.

Perhaps we suspect that a state change in Y1 leads to a state change in Y2 or vice versa and we want to measure the association between these variables. Using the three measures of correlation, we get the following results:

Continue reading

Monty Python

Every now and then I decide to overthink a problem I thought I understood and get confused – last week, it was the Monty Hall problem. 

For those unfamiliar with the thought experiment, the basic premise is that you are on a game show and are presented with three doors. Behind one of the doors is a car, while behind the other two are goats. 

With zero initial information, you make a guess as to which door you think the car is behind (we assume you have enough goats already). Before looking behind your chosen door, the host opens one of the remaining two doors and reveals a goat. The host then asks you if you would like to change your guess. What should you do? 

Continue reading