So, you are interested in compound selectivity and machine learning papers?

At the last OPIG meeting, I gave a talk about compound selectivity and machine learning approaching to predict whether a compound might be selective. As promised, I hereby provide a list publications I would hand to a beginner in the field of compound selectivity and machine learning.

Concept of  compound selectivity (based on structural or ligand features)? Read these publication!

  • Haupt, V. J., Daminelli, S. & Schroeder, M. Drug Promiscuity in PDB: Protein Binding Site Similarity Is Key. PLoS One 8, 1–15 (2013).
  • Sturm, N., Desaphy, J., Quinn, R. J., Rognan, D. & Kellenberger, E. Structural insights into the molecular basis of the ligand promiscuity. J. Chem. Inf. Model. 52, 2410–2421 (2012).
  • Hu, Y. & Bajorath, J. Compound promiscuity: What can we learn from current data? Drug Discov. Today 18, 644–650 (2013).
  • Baell, J. & Walters, M. A. Chemistry: Chemical con artists foil drug discovery. Nature 513, 481–483 (2014).

More about ‘Chemical Probes’ or ‘Tool Compounds’, compounds which are highly selective towards a specific target? Read these publications!

  • Arrowsmith, C. H. et al. The promise and peril of chemical probes. Nat. Chem. Biol. 11, 536–541 (2015).
  • Wang, Y. et al. Evidence-Based and Quantitative Prioritization of Tool Compounds in Phenotypic Drug Discovery. Cell Chem. Biol. 23, 862–874 (2016).
  • Butler, K. V, Macdonald, I. A., Hathaway, N. A. & Jin, J. Report and Application of a Tool Compound Data Set. J. Chem. Inf. Model 57, 2699−2706 (2017).

Machine Learning to determine compound selectivity? Read these publications!

  • Sorgenfrei, F. A., Fulle, S. & Merget, B. Kinome-Wide Profiling Prediction of Small Molecules. ChemMedChem 1–6 (2017).
  • Giblin, K., Hughes, S., Boyd, H., Hansson, P. & Bender, A. Prospectively Validated Proteochemometric Models for the Prediction of Small Molecule Binding to Bromodomain Proteins. J. Chem. Inf. Model. XXXX, XXX, XXX-XXX acs.jcim.8b00400 (2018).
  • Subramanian, V., Prusis, P., Pietilä, L. O., Xhaard, H. & Wohlfahrt, G. Visually interpretable models of kinase selectivity related features derived from field-based proteochemometrics. J. Chem. Inf. Model. 53, 3021–3030 (2013).

Interested in more machine learning as an extension to the publications above? These publications might be a good starting point…

  • Cortés-Ciriano, I. et al. Polypharmacology modelling using proteochemometrics (PCM): recent methodological developments, applications to target families, and future prospects. Med. Chem. Commun. 6, 24–50 (2015).
  • Krstajic, D., Buturovic, L. J., Leahy, D. E. & Thomas, S. Cross-validation pitfalls when selecting and assessing regression and classification models. J. Cheminform. 6, 1–15 (2014).
  • Sun, J. et al. Applying Mondrian Cross-Conformal Prediction to Estimate Prediction Confidence on Large Imbalanced Bioactivity Data Sets. J. Chem. Inf. Model. 57, 1591–1598 (2017).


OPunting 2018

Hi everyone!

Today is the day to present to you my belated blogpost on OPIG punting (or OPunting for short). I promise I was not procrastinating on writing it. I am currently not in Oxford, as I am visiting beautiful Zurich as proved by the photo below.

Anyway, this blogpost is not about how much I enjoy Switzerland and Spezial Bratwursts.

OPunting took place on sunny/cloudy August 31st. About 20 ±1 OPIGlets were allocated into 5 punts, which were full of healthy snacks and drinks. We sailed off from the Cherwell Boathouse with the usual plan to punt to the Victoria Arms, where people could stretch their legs and have a pint (or two) of an ice cold beverage of their choice.

And so we were off:

If you did not know, Javi is always there to have your back if needed. As you can see below, Lucian seemed to run out of energy, but Javi was already there to replenish Lucian’s energy level. This shows the true spirit of OPIG.

We were glad to have Eoin as the captain of our punt. As being one of the most experienced punters in OPIG, he knew that staying close to other OPIG punts would greatly slow them down and they would never make it to the Victoria Arms. And Eoin was right, as we arrived to our destination much faster than other. Cheers!

Whilst we were finishing our beverages, others still struggled to get to the pub.

Things got quite competitive on the way back to the Cherwell Boathouse as Charlotte organized a race whereby five first year OPIGlets were selected as punters. Fergus I was a bit too eager to win the race and lost his balance, thus going for a short swim in the river Cherwell.

As there were no rules stated for this race, OPIGlets did their best to stop others from winning. The most common act of sabotage was stealing someone’s punting pole. Our punting pole was stolen by Chalotte’s punt, but it did not stop us from winning the race.

Whilst many OPIGlets were competing, others took it easy and enjoyed their day to the maximum.

After returning punts back to the Cherwell Boathouse, all of us (except for Matt) headed to a local pub, the name of which I cannot remember. On this note, I would like to conclude my blogpost on our annual eventful OPunting.



Mol2vec: Finding Chemical Meaning in 300 Dimensions

Embeddings of Amino Acids

2D projections (t-SNE) of Mol2vec vectors of amino acids (bold arrows). These vectors were obtained by summing the vectors of the Morgan substructures (small arrows) present in the respective molecules (amino acids in the present example). The directions of the vectors provide a visual representation of similarities. Magnitudes reflect importance, i.e. more meaningful words. [Figure from Ref. 1]

Natural Language Processing (NLP) algorithms are usually used for analyzing human communication, often in the form of textual information such as scientific papers and Tweets. One aspect, coming up with a representation that clusters words with similar meanings, has been achieved very successfully with the word2vec approach. This involves training a shallow, two-layer artificial neural network on a very large body of words and sentences — the so-called corpus — to generate “embeddings” of the constituent words into a high-dimensional space. By computing the vector from “woman” to “queen”, and adding it to the position of “man” in this high-dimensional space, the answer, “king”, can be found.

A recent publication of one of my former InhibOx-colleagues, Simone Fulle, and her co-workers, Sabrina Jaeger and Samo Turk, shows how we can embed molecular substructures and chemical compounds into a similarly high-dimensional, continuous vectorial representation, which they dubbed “mol2vec“.1 They also released a Python implementation, available on Samo Turk’s GitHub repository.


Continue reading

Cinder: Crystallographic Tinder

Protein structure determination is still dominated by xray diffraction. For diffraction studies structural biologists need to grow and optimise protein crystals until they diffract to an usable and optimal resolution. A purified protein sample is exposed to a number of crystallisation screens, each comprising a selection of chemical conditions that are designed to explore a reasonably wide area of potential crystallisation conditions.

Many crystallography labs routinely image these in large plate storage systems, which reduces the human interaction to viewing a set of usually 100-1000 images at various time points. This is a slow and laborious process, and highly applicable to machine learning approaches tailored to looking at images. TexRank, a texton analysis ranking software was developed by Jia Tsing in OPIG and is used at the Structural Genomics Consortium (SGC). This ranking reduces the number of images that a human needs to search through, providing a quicker review process.

However the ultimate aim is to further reduce or remove the human review step. The first step is to classify images, with the most important classification being whether a crystal is present. MARCO uses annotated images to classify images into four categories:

  • Crystals: 91% predicted
  • Precipitate: 96.1% predicted
  • Clear: 97.9% predicted
  • Other: 69.6% predicted

This is typically better than human classification, when a human classifies two image sets at the beginning and en of ~1000 crystal images, they are around 85% accurate (Snell et al, 2008).

Cinder (Crystallographic Tinder) is an app (Andriod & IOS) that collects human categorisations of crystals to produce a labelled set, that can be used for further machine learning approaches to categorising images. A user can swipe to classify a crystal into four categories. A learning mode (KInder) is supplied to teach new crystallographers how to classify a variety of image types. The app can also be used to score a user’s own plates (C3 facility users).


Although identifying a crystal/ precipitate in a drop is essential, reducing the human interaction will require further classification efforts. For example a crystal screening drop may contain precipitant, crystal and micro crystals. Identifying these features hierarchically will be needed to further study whether that condition could be considered viable. Furthermore, following the potential crystallinity of a drop over time is important, to determine whether a condition can be optimised to produce higher quality crystals. Classifying crystallisation outcomes would ideally be used to predict the conditions in which a protein may crystallise, however this is far from reality in the crystallisation community,


  • Ng, Jia Tsing et al. “Using Textons to Rank Crystallization Droplets by the Likely Presence of Crystals.” Acta Crystallographica Section D: Biological Crystallography 70.Pt 10 (2014): 2702–2718. PMC. Web. 28 Aug. 2018.
  • Snell, Edward H. et al. “Establishing a Training Set through the Visual Analysis of Crystallization Trials. Part I: ∼150 000 Images.” Acta Crystallographica Section D: Biological Crystallography 64.Pt 11 (2008): 1123–1130. PMC. Web. 28 Aug. 2018.
  • Bruno AE, Charbonneau P, Newman J, Snell EH, So DR, et al. (2018) Classification of crystallization outcomes using deep convolutional neural networks. PLOS ONE 13(6): e0198883.
  • Rosa, N., Ristic, M., Marshall, B. & Newman, J. (2018). Acta Cryst. F74, 410-418.

Rasmus Fonseca and GetContacts

We welcomed Rasmus Fonseca to last week’s OPIG Group Meeting. Rasmus is currently a Visiting Scholar at Stanford. He gave a fascinating talk about the interaction analysis of molecular structures and ensembles using the GetContacts package, one of many projects that he has contributed to that you can find on his GitHub repo.

Rasmus was kind enough to share his slides with us: