Category Archives: Machine Learning

Lucubration or Gaslighting?​

Or: The best lies have a nugget of truth in them.​

Lucubration – The action or occupation of intensive study originally by candle or lamplight.

Gaslighting – Psychological abuse in which a person or group causes someone to question their own sanity, memories, or perception.

I was recently having a play with Google Bard. Bard, unlike ChatGPT has access to live data. It also undergoes live feedback and quality control. I was hoping to see if it would find me any journals with articles on prion research which I’d previously overlooked.

Me: Please show me some recent articles about prion research.
(Because always be polite to our AI overlords, they’ll remember!)

Continue reading

What can you do with the OPIG Immunoinformatics Suite? v3.0

OPIG’s growing immunoinformatics team continues to develop and openly distribute a wide variety of databases and software packages for antibody/nanobody/T-cell receptor analysis. Below is a summary of all the latest updates (follows on from v1.0 and v2.0).

Continue reading

PHinally PHunctionalising my PHigures with PHATE feat. Plotly Express.

After being recommended by a friend, I really wanted to try plotly express but I never had the inclination to read more documentation when matplotlib gives me enough grief. While experimenting with ChatGPT I finally decided to functionalise my figure making scripts. With these scripts I manage to produce figures that made people question what I had actually been doing with my time – but I promise this will be worth your time.

I have been using with dimensionality reducition techniques recently and I came across this paper by Moon et al. PHATE is a technique that represents high dimensional (ie biological) data in a way that aims to preserve connections over preserving distance and I knew I wanted to try this as soon as I saw it. Why should you care? PHATE in 3D is faster that t-SNE in 2D. It would almost be rude to not try it out.

PHATE

In my opinion PHATE (or potential of heat diffusion for affinity-based transition embedding) does have a lot going on but that the choices at each stage feel quite sensisble. It might not come as a surprise this was primarily designed to make visual inspection of data easier on the eyes.

Continue reading

9th Joint Sheffield Conference on Cheminformatics

Over the next few days, researchers from around the world will be gathering in Sheffield for the 9th Joint Sheffield Conference on Cheminformatics. As one of the organizers (wearing my Molecular Graphics and Modeling Society ‘hat’), I can say we have an exciting array of speakers and sessions:

  • De Novo Design
  • Open Science
  • Chemical Space
  • Physics-based Modelling
  • Machine Learning
  • Property Prediction
  • Virtual Screening
  • Case Studies
  • Molecular Representations

It has traditionally taken place every three years, but despite the global pandemic it is returning this year, once again in person in the excellent conference facilities at The Edge. You can download the full programme in iCal format, and here is the conference calendar:

Continue reading

Machine learning strategies to overcome limited data availability

Machine learning (ML) for biological/biomedical applications is very challenging – in large part due to limitations in publicly available data (something we recently published about [1]). Substantial amounts of time and resources may be required to generate the types of data (eg protein structures, protein-protein binding affinity, microscopy images, gene expression values) required to train ML models, however.

In cases where there is sufficient data available to provide signal, but not enough for the desired performance, ML strategies can be employed:

Continue reading

Train Your Own Protein Language Model In Just a Few Lines of Code

Language models have token the world by storm recently and, given the already explored analogies between protein primary sequence and text, there’s been a lot of interest in applying these models to protein sequences. Interest is not only coming from academia and the pharmaceutical industry, but also some very unlikely suspects such as ByteDance – yes the same ByteDance of TikTok fame. So if you also fancy trying your hand at building a protein language model then read on, it’s surprisingly easy.

Training your own protein language model from scratch is made remarkably easy by the HuggingFace Transformers library, which allows you to specify a model architecture, tokenise your training data, and train a model in only a few lines of code. Under the hood, the Transformers library uses PyTorch (or optionally Tensorflow) models, allowing you to dig deeper into customising training or model architecture, or simply leave it to the highly abstracted Transformers library to handle it all for you.

For this article, I’ll assume you already understand how language models work, and are now looking to implement one yourself, trained from scratch.

Continue reading

The State of Computational Protein Design

Last month, I had the privilege to attend the Keystone Symposium on Computational Design and Modeling of Biomolecules in beautiful Banff, Canada. This conference gave an incredible insight into the current state of the protein design field, as we are on the precipice of advances catalyzed by deep learning.

Here are my key takeaways from the conference:

Continue reading

Be a computational chemist and you must be a jack of all trades

Being a jack of all trades brings to mind someone who has extensive multidisciplinary expertise and is equipped with many tools in their toolbox to solve different problems. A jack of all trades is a great succinct description for computational chemists in drug discovery.

Recently I had a great conversation with Dr. Arjun Narayanan, a Senior Research Scientist at Vertex Pharmaceuticals and a jack of all trades as a computational chemist. In this blog post, I’ll describe what he does as a computational chemist, the problems he solves, and the new tools he’s looking forward to adding to his toolbox.

Continue reading

An Overview of Clustering Algorithms

During the first 6 months of my DPhil, I worked on clustering antibodies and I thought I would share what I learned about these algorithms. Clustering is an unsupervised data analysis technique that groups a data set into subsets of similar data points. The main uses of clustering are in exploratory data analysis to find hidden patterns or data compression, e.g. when data points in a cluster can be treated as a group. Clustering algorithms have many applications in computational biology, such as clustering antibodies by structural similarity. Actually, this is objectively the most important application and I don’t see why anyone would use it for anything else.

There are several types of clustering algorithms that offer different advantages.

Continue reading

Molecular conformation generation with a DL-based force field

Deep learning (DL) methods in structural modelling are outcompeting force fields because they overcome the two main limitations to force fields methods – the prohibitively large search space for large systems and the limited accuracy of the description of the physics [4].

However, the two methods are also compatible. DL methods are helping to close the gap between the applications of force fields and ab initio methods [3]. The advantage of DL-based force fields is that the functional form does not have to be specified explicitly and much more accurate. Say goodbye to the 12-6 potential function.

In principle DL-based force fields can be applied anywhere where regular force fields have been applied, for example conformation generation [2]. The flip-side of DL-based methods commonly is poor generalization but it seems that force fields, when properly trained, generalize well. ANI trained on molecules with up to 8 heavy atoms is able to generalize to molecules with up to 54 atoms [1]. Excitingly for my research, ANI-2 [2] can replace UFF or MMFF as the energy minimization step for conformation generation in RDKit [5].

So let’s use Auto3D [2] to generated low energy conformations for the four molecules caffeine, Ibuprofen, an experimental hybrid peptide, and Imatinib:

CN1C=NC2=C1C(=O)N(C(=O)N2C)C CFF
CC(C)Cc1ccc(cc1)C(C)C(O)=O IBP
Cc1ccccc1CNC(=O)[C@@H]2C(SCN2C(=O)[C@H]([C@H](Cc3ccccc3)NC(=O)c4cccc(c4C)O)O)(C)C JE2
Cc1ccc(cc1Nc2nccc(n2)c3cccnc3)NC(=O)c4ccc(cc4)CN5CCN(CC5)C STI
Continue reading