Monthly Archives: May 2021

Hosting multiple Flask apps using Apache/mod_wsgi

A common way of deploying a Flask web application in a production environment is to use an Apache server with the mod_wsgi module, which allows Apache to host any application that supports Python’s Web Server Gateway Interface (WSGI), making it quick and easy to get an application up and running. In this post, we’ll go through configuring your Apache server to host multiple Python apps in a stable manner, including how to run apps in daemon mode and avoiding hanging processes due to Python C extensions not working well with Python sub-interpreters (I’m looking at you, numpy).

Continue reading →

Hidden Markov Models in Python: A simple Hidden Markov Model with Known Emission Matrix fitted with hmmlearn

The Hidden Markov Model

Consider a sensor which tells you whether it is cloudy or clear, but is wrong with some probability. Now, the weather *is* cloudy or clear, we could go and see which it was, so there is a “true” state, but we only have noisy observations on which to attempt to infer it.

We might model this process (with the assumption of sufficiently precious weather), and attempt to make inferences about the true state of the weather over time, the rate of change of the weather and how noisy our sensor is by using a Hidden Markov Model.

The Hidden Markov Model describes a hidden Markov Chain which at each step emits an observation with a probability that depends on the current state. In general both the hidden state and the observations may be discrete or continuous.

But for simplicity’s sake let’s consider the case where both the hidden and observed spaces are discrete. Then, the Hidden Markov Model is parameterised by two matrices:

Continue reading →

CAML: Courses in Applied Machine Learning

*Shameless self-promotion klaxon!! Have a look at my new website!*

I’m excited to share a project I’ve been working on for the past few months! One of the biggest challenges of working on an interdisciplinary research project is getting to grips with the core principles of the disciplines which you don’t have much formal training in. For me, that means learning the basics of Medicinal Chemistry and Structural Biology so that when someone mentions pi-stacking I don’t think they’re talking about the logistics of managing a bakery; for people coming from Bio/Chem backgrounds it can mean understanding the Maths and Statistics necessary to make sense of the different algorithms which are central to their work.

Continue reading →

Can few-shot language models perform bioinformatics tasks?

In 2019, I tried my hand at using large language models, specifically GPT-2, for text generation. In that blogpost, I used Hansard files to fine-tune the public release of GPT-2 to generate speeches by several speakers in the House of Commons (link).

In 2020, OpenAI released GPT-3, their new and improved text generation model (paper), which uses a whopping 175 billion parameters (as opposed to its predecessor’s 1.5 billion) and not only proved to be capable of state of the art performance on common text prediction benchmarks, but also generated a considerable amount of interest in the news media:

Continue reading →

Code that I am grateful for

To address some of the karmic imbalance created by computational scientists complaining about other people’s code, I am listing here some (not all) of other people’s code that I love.

IgBLAST

IgBLAST is a sequence alignment tool for immunoglobulin sequences implemented in the NCBI C++ toolkit – it applies the classic BLAST algorithm to searching immunoglobulin germline gene databases. It always impresses me how quickly it works. The paper is here, and the authors are Jian Ye, Ning Ma, Thomas L. Madden and James M. Ostell.

Continue reading →

Do antibodies care about sex?

In a recent OPIG antibody meeting, the topic of immune system differences between men and women came up. I thought this was cool and something I hadn’t read about, so what a brilliant topic for a blog most. This post is a high-level overview – I’ve listed the papers I’ve used at the bottom of this post so please consult them for more details!

Differences between males and females can lead to pretty big disparities in disease prevalence and outcomes. For example, non-reproductive cancers occur predominantly in males, whilst the majority of autoimmune disease occurs in females. Many factors may be impacting this, including environmental, genetic and hormonal influences, and much more research is required to fully understand these processes. Here I focus on sex-based biology, rather than gender, though both can influence the immune response.

Continue reading →

Bioinformatics Hackathon Reflection

A week ago I participated in Copenhagen Bioinformatics Hackathon 2021, a hackathon focusing on machine learning and proteins, as a mentor for a challenge proposed by our group. The whole experience was fun, but I am also sitting here contemplating over a lot of things I wish I had done differently. For this blog text, I therefore want to highlight two changes which I believe would have greatly improved my challenge and which can hopefully also work as an inspiration for others presenting a hackathon challenge.

Going into this event I had some experience from a few hackathons I had previously attended. Based on this, I wanted to create a challenge containing two parts. First, a simple task which everyone would be able to create a solution for, and second, a more challenging addition to the first task for more experienced participants. I decided to go with the challenge of predicting which heavy and light chains can form a pair, where the additional challenge was to try to visualize which residues were relevant for this interaction. Together with OAS containing a really nice positive dataset of paired chains, I thought this was going to be an amazing challenge, but as soon as the event began I started seeing the flaws of the challenge.

Continue reading →

6 things I’ve learnt in my first year as a PhD student

Despite spending only four weeks working in the department, this month roughy marks a year since I started my unlikely career as a statistician and was inaugurated into the hall of opiglets (if you account for my foray into the magic of quantum computing last summer). The past year has been filled with learning opportunities, some of which I ought to take note and others are probably worth forgetting. Nonetheless, here is a short list of things I’ve learned in my first year as a DPhil Student, which you may find helpful in what I hope are more precedented times.

Simple and stupid first

When it comes to deciding how to tackle your next scientific problem or which lesson to start your blog post with, often the simplest and sometimes most ‘stupid’ idea is the way to go. Keeping things simple gives you the time to better understand your question without getting lost in the details of a complex solution. Plus, the results will inform your later next steps.

Continue reading →

Singularity: a guide for the bewildered bioinformatician

Have you ever worked with a piece of software that is awfully difficult to set up? That legacy code written on FORTRAN 77, that other one that requires significant modifications to compile, or any of those that require a long-winded bash script with a thousand dependencies (which you also have to install!). Would it not be helpful if, when that red-eyed PhD student, that one that just spent three months writing up their thesis, says that they absolutely must use that server where you have installed all your stuff, you could just relocate to another one without trouble? Well, you may be able to do that now. You just need to use containerization.

The idea behind containerization is rather simple. The best way to ensure anyone can reproduce your work is to, well, ship your entire system to whomever needs to use it. You could, for example, pack up your desktop in a box, and ship it to your collaborators anywhere in the world. Unfortunately, this idea is quite unpractical, not only because of tedious logistics (ever had to deal with customs?), but also because suddenly you won’t be able to run your own pipeline. However, it is a good enough thought that at some point made a clever engineer wonder whether there was a way to ship an entire system without physically delivering the computer. And that’s exactly what they designed.

40ft x 8ft (9ft 6") One trip high cube shipping container bl — Best way to make sure your collaborators on the other side of the world can run your pipeline — just pack your desktop in one of these, and ship it away!

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Monthly Archives: May 2021

Hosting multiple Flask apps using Apache/mod_wsgi

Hidden Markov Models in Python: A simple Hidden Markov Model with Known Emission Matrix fitted with hmmlearn

The Hidden Markov Model

CAML: Courses in Applied Machine Learning

Can few-shot language models perform bioinformatics tasks?

Code that I am grateful for

Do antibodies care about sex?

Bioinformatics Hackathon Reflection

6 things I’ve learnt in my first year as a PhD student

Singularity: a guide for the bewildered bioinformatician