Monthly Archives: September 2021

Watch out when using PDBbind!

Now that PDBbind 2020 has been released, I want to draw some attention to an issue with using the SDF files that are supplied in the PDBbind refined set 2020.

Normally, SDF files save the chirality information of compounds in the atom block of the file which is shown belowas a snipped of the full sdf file for the ligand of PDB entry 4qsv. The column that defines chirality is marked in red.

As you can see, all columns shown here are 0. The SDF files supplied by PDBbind for some reason do NOT encode chirality information explicitly. This will be a problem when using RDKit to read the molecule and transform it into a smiles string. By using the following commands to read the ligand for 4qsv from PDBBind 2020 and write a SMILES string, we get:

Continue reading →

Chained or Unchained: Markov, Nekrasov and Free Will

A Markov Chain moving between two states A and B. Animation by Devin Soni

Markov chains are simple probabilistic models which model sequences of related events through time. In a Markov chain, events at the present time depend on the previous event in the sequence. The example above shows a model of a dynamical system with two states A and B and the events are either moving between states A and B, or staying put.

More formally, a Markov chain is a model of any sequence of events with the following relationship

$P(X_t=x|X_{t-1}=x_{t-1},X_{t-2}=x_{t-2},..,X_1=x_1)=P(X_t|X_{t-1})$ .

That is, the event that the sequence $\{X_t\}_{t}$ is in state $x$ at time $t$ is conditionally independent of all of its past states given its immediate past. This simple relationship between past and present provides a useful simplifying assumption to model, to a surprising degree of accuracy, many real world systems. These range from air particles diffusing through a room, to the migration patterns of insects, to the evolution of your genome, and even your web browser activity. Given their broad use in describing natural phenomena, it is very curious that Markov first invented the Markov chain to settle a dispute in Mathematical Theology, one in which the atheist Markov was pitted against the devoutly Orthodox Pavel Nekrasov.

Continue reading →

Why all academics should be on TikTok

Recently I have had the opportunity to get a closer look at the submission, review and promotion cycle for a typical academic paper. It was a great learning experience and led to an increase in the number and of research papers, news articles, and reviews I read in preparation. However, on multiple occasions, I did think “I wish I could watch a 2 min video to explain this”. That got me thinking, why couldn’t I and should I be able to?

Continue reading →

Being Brief.

This is a blog post about using fewer words.

Continue reading →

Using Singularity on Windows with WSL2

Previously on this blog, my colleagues Carlos and Eoin have extolled the many virtues of Singularity, which I will not repeat here. Instead, I’d like to talk about a rather interesting subject that was unexpectedly thrust upon me when my faithful Linux laptop started to show the early warning signs of critical existence failure: is there a good way to run a Singularity container on a pure Windows machine? It turns out that, with version 2 of the Windows Subsystem for Linux (WSL), there is.

Continue reading →

Lessons in Scientific Code Deployment

So, I recently deployed my first piece of scientific code. Well, sort of. I made a github with instructions on how to download, install and run it.

And then everyone broke it.

So, now having been on tech support duty for a few weeks, it seemed like a good idea to have a think about what I’ve learned.

Now, there is a big preface to this: the first and most important thing I learned is that I should do some reading on how to do this well. I have not yet done that reading, so this post isn’t so much going to offer any advice as catalogue my mistakes. Mistakes that will probably look extremely silly to anyone who has any familiarity with deployment, but might be interesting to anyone who doesn’t.

A surprising number of people really don’t want to touch the command line

Being a programmer who spends the vast majority of their time on the command line, invoking programs from there is very natural. As such, I very much underestimated the obstacle that even installing anaconda, a few packages, and cloning the source code would be. Even with instructions to copy and paste.

The issue is, if anything goes wrong, there is a good chance they don’t know whether it is my code or their environment breaking, which probably means they need to contact me about it (more on environments later).

Really, I probably could have saved myself an awful lot of support by making it an installable, and more with a gui to guide people through using the program.

Python is a pain

So, the first thing I learned was something I’d kind of been warned about: deploying python code is a pain in the butt. Especially to people who aren’t familiar with python, managing python environments is both tricky and overwhelming easy to break code with. Run a python script from the wrong environment and it is going to fail: if you are lucky with a failure to import a module, if you are unlucky with a cryptic error due to say changes between various python versions.
Speaking of python versions, developing in 3.9 and not testing in 3.7 then telling people to install that can result in a surprising number of surprisingly difficult bugs.

The instructions weren’t clear enough

Scientific code I think generally caters an awful lot to expert users, people who really understand the model and even are willing to open the source code to figure out the implementation.

My first stab at documentation managed to not be clear enough to the people who didn’t want to touch the command line and those who were willing to open the source code because they wanted to do something spicy.

So yeah, good documentation is an acquired skill.

Distributed computing is a nightmare

In principle, distribution is terrific: get a library that will allow you to reduce running arbitrary python code on multiple nodes to a simple map-like interface. On big clusters, like a lot of scientists use, this can mean speed ups from 10 to even 1000 times.

The only problem is, everyone’s cluster is a special snowflake, and you can’t access most of them to fix things. This can make iteration with a non-programmer painfully slow.

Libraries don’t help as much as I’d have thought either: indeed, my experience of Dask and Dask Jobqueue has been a consistently uphill battle. From the fact that my workload likes individual nodes sharing lots of memory and a few cpus to some truly arcane errors (one that broke in the msgpack code), I have generally considered (and even started) writing my own code to do this.

Active development doesn’t reach people

Code that is being updated several times a day in response to bugfixes can be great – but if people aren’t pulling and installing it, no-one is going to benefit. I’m seriously tempted to write some code to either auto-update on running or at least let folk know it has been updated.

Summary

In summary, a lot went wrong in my first stab at this. Very much come to appreciate a good deployment is an artform, and I’ve got an awful lot of reading to do. In particular, the above problem areas really have eaten a lot of time that probably could have been used doing actual science with the code, so there is a good incentive to get it right.

New search features for the Structural Antibody Database (SAbDab)

Since its original publication in 2013, we have added several advanced search features to the Structural Antibody Database. This post aims to give an overview over some of these features.

Continue reading →

Antibodies for gut or bad

Over the last two decades, there has been mounting evidence of the role of the gut microbiome (the collection of microorganisms in the GI tract) in metabolic disorder (Fan and Pedersen 2021) and more recently, in psychiatric illness (Morais, Schreiber, and Mazmanian 2021). The maintenance of the equilibrium of commensal bacteria and their proper compartmentalization and stratification in the gut is critical for health.

There are diverse factors regulating microbiota composition (microbiota homeostasis) (Macpherson and McCoy 2013). I am principally interested in the role of antibodies – the idea that antibodies participate in this process is controversial (Kubinak and Round 2016) because of the difficulty of controlling for the multiple confounding environmental variables that influence the microbiome, but there are theories as to how this happens. The process of the shaping of the microbiota by antibodies was dubbed “antibody-mediated immunoselection” (AMIS) by (Kubinak and Round 2016).

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Monthly Archives: September 2021

Watch out when using PDBbind!

Chained or Unchained: Markov, Nekrasov and Free Will

Why all academics should be on TikTok

Being Brief.

Using Singularity on Windows with WSL2

Lessons in Scientific Code Deployment

A surprising number of people really don’t want to touch the command line

Python is a pain

The instructions weren’t clear enough

Distributed computing is a nightmare

Active development doesn’t reach people

Summary

New search features for the Structural Antibody Database (SAbDab)

Antibodies for gut or bad