At group meeting a few weeks ago I presented this paper, “Landscape of Non-canonical Cysteines in Human VH Repertoire Revealed by Immunogenetic Analysis“, from Prabakaran and Chowdhury. The paper is an investigation of the frequency, location and patterns of cysteines contained in human antibody sequences. Cysteines are important amino acids found in proteins, including antibodies, which can form disulphide bonds with other cysteines due to the presence of their reactive sulfhydryl group in the side chain.
Continue readingCategory Archives: Journal Club
Learning from Biased Datasets
Both the beauty and the downfall of learning-based methods is that the data used for training will largely determine the quality of any model or system.
While there have been numerous algorithmic advances in recent years, the most successful applications of machine learning have been in areas where either (i) you can generate your own data in a fully understood environment (e.g. AlphaGo/AlphaZero), or (ii) data is so abundant that you’re essentially training on “everything” (e.g. GPT2/3, CNNs trained on ImageNet).
This covers only a narrow range of applications, with most data not falling into one of these two categories. Unfortunately, when this is true (and even sometimes when you are in one of those rare cases) your data is almost certainly biased – you just may or may not know it.
Continue readingJournal Club: the Dynamics of Affinity Maturation
Last week at our group meeting I presented on a paper titled “T-cell Receptor Variable beta Domains Rigidify During Affinity Maturation” by Monica L. Fernández-Quintero, Clarissa A. Seidler and Klaus R. Liedl. The authors use metadynamics simulations of the same T-cell Receptor (TCR) at different stages of affinity maturation to study the conformational landscape of the complementarity-determining regions (CDRs), and how this might relate to an increase in affinity. Not only do they conclude that affinity maturation leads to rigidification of CDRs in solution, but they also present some evidence for the conformational selection model of biomolecular binding events in TCR-antigen interactions.
Continue readingThe Coronavirus Antibody Database (CoV-AbDab)
We are happy to announce the release of CoV-AbDab, our database tracking all coronavirus binding antibodies and nanobodies with molecular-level metadata. The database can be searched and downloaded here: http://opig.stats.ox.ac.uk/webapps/coronavirus
Continue readingJournal Club: Is our data biased, and should it be?

Last week I presented the above paper at group meeting. While a little different from a typical OPIG journal club paper, the data we have access to almost certainly suffers from the same range of (possible) biases explored in this paper.
Continue readingLearning dynamical information from static protein and sequencing data
I would like to advertise the research from Pearce et al. (https://doi.org/10.1101/401067) whose talk I attended at ISMB 2019. The talk was titled ‘Learning dynamical information from static protein and sequencing data’. I got interested in it as my field of research is structural biology which deals with dynamics systems, e.g. proteins, but data is often static, e.g. structures from X-ray crystallography. They presented a general protocol to infer transition rates between states in a dynamical system that can be represented with an energy landscape.
Continue readingJournal Club: Investigating Allostery with a lot of Crystals
Keedy et al. 2018: An expanded allosteric network in PTP1B by multitemperature crystallography, fragment screening, and covalent tethering.
Allostery is defined as a conformational/activity change of an active site due to a binding event at a distant (allosteric) site.
The paper I presented in the journal club tried to decipher the underlying mechanics of allostery in PTP1B. It is a protein tyrosine phosphatase (the counter parts of kinases) and a validated drug target. Allosteric binding sites are known but so far neither active site nor allosteric site inhibitors have reached clinical use. Thus, an improved mechanistic understanding could improve drug discovery efforts.
Continue readingKernel Methods are a Hot Topic in Network Feature Analysis
The kernel trick is a well known method in machine learning for producing a real-valued measure of similarity between data points in any number of settings. Kernel methods for network analysis provide a way of assigning real values to vertices of the graph. These values may correspond to similarity across any number of graphical properties such as the neighbours they share, or more dynamic context, the influence that change in the state of one vertex might have on another.
By using the kernel trick it is possible to approximate the distribution of features on the vertices of a graph in a way that respects the graphical relationships between vertices. Kernel based methods have long been used, for instance in inferring protein function from other proteins within Protein Interaction Networks (PINs).
Continue ReadingCheck My Blob
A brief overview and discussion of: Automatic recognition of ligands in electron density by machine learning .This paper aims to reduce the bias of crystallographers fitting ligands into electron density for protein ligand complexes. The authors train a supervised machine learning model using known ligand sites across the whole protein databank, to produce a classifier that can identify which common ligands could fit to that electron density.
Mol2vec: Finding Chemical Meaning in 300 Dimensions

2D projections (t-SNE) of Mol2vec vectors of amino acids (bold arrows). These vectors were obtained by summing the vectors of the Morgan substructures (small arrows) present in the respective molecules (amino acids in the present example). The directions of the vectors provide a visual representation of similarities. Magnitudes reflect importance, i.e. more meaningful words. [Figure from Ref. 1]
A recent publication of one of my former InhibOx-colleagues, Simone Fulle, and her co-workers, Sabrina Jaeger and Samo Turk, shows how we can embed molecular substructures and chemical compounds into a similarly high-dimensional, continuous vectorial representation, which they dubbed “mol2vec“.1 They also released a Python implementation, available on Samo Turk’s GitHub repository.
