Slowing the progress of prion diseases

At present, the jury is still out on how prion diseases affect the body let alone how to cure them. We don’t know if amyloid plaques cause neurodegeneration or if they’re the result of it. Due to highly variable glycophosphatidylinositol (GPI) anchors, we don’t know the structure of prions. Due to their incredible resistance to proteolysis, we don’t know a simple way to destroy prions even using in an autoclave. The current recommendation[0] by the World Health Organisation includes the not so subtle: “Immerse in a pan containing 1N sodium hydroxide and heat in a gravity displacement autoclave at 121°C”.

There are several species including Water Buffalo, Horses and Dogs which are immune to prion diseases. Until relatively recently it was thought that rabbits were immune too. “Despite rabbits no longer being able to be classified as resistant to TSEs, an outbreak of ‘mad rabbit disease’ is unlikely”.[1] That being said, other than the addition of some salt bridges and additional H-bonds, we don’t know if that’s why some animals are immune.

We do know at least two species of lichen (P. sulcata and L. plumonaria) have not only discovered a way to naturally break down prions, but they’ve evolved two completely independent pathways to do so. How they accomplish this? We’re still not sure in fact, it was only last year that it was discovered that lichens may be composed of three symbiotic partnerships and not two as previously thought.[3]

With all this uncertainty, one thing is known: PrPSc, the pathogenic form of the Prion converts PrPC, the cellular form. Just preventing the production of PrPC may not be a good idea, mainly because we don’t know what it’s there for in the first place. Previous studies using PrP-knockout have shown hints that:

Hematopoietic stem cells express PrP on their cell membrane. PrP-null stem cells exhibit increased sensitivity to cell depletion. [4]
In mice, cleavage of PrP proteins in peripheral nerves causes the activation of myelin repair in Schwann Cells. Lack of PrP proteins caused demyelination in those cells. [5]
Mice lacking genes for PrP show altered long-term potentiation in the hippocampus. [6]
Prions have been indicated to play an important role in cell-cell adhesion and intracellular signalling.[7]

However, an alternative approach which bypasses most of the unknowns above is if it were possible to make off with the substrate which PrPSc uses, the progress of the disease might be slowed. A study by R Diaz-Espinoza et al. was able to show that by infecting animals with a self-replicating non-pathogenic prion disease it was possible to slow the fatal 263K scrapie agent. From their paper [8], “results show that a prophylactic inoculation of prion-infected animals with an anti-prion delays the onset of the disease and in some animals completely prevents the development of clinical symptoms and brain damage.”

[0] https://www.cdc.gov/prions/cjd/infection-control.html
[1] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3323982/
[2] https://blogs.scientificamerican.com/artful-amoeba/httpblogsscientificamericancomartful-amoeba20110725lichens-vs-the-almighty-prion/
[3] http://science.sciencemag.org/content/353/6298/488
[4] “Prion protein is expressed on long-term repopulating hematopoietic stem cells and is important for their self-renewal”. PNAS. 103 (7): 2184–9. doi:10.1073/pnas.0510577103
[5] Abbott A (2010-01-24). “Healthy prions protect nerves”. Nature. doi:10.1038/news.2010.29
[6] Maglio LE, Perez MF, Martins VR, Brentani RR, Ramirez OA (Nov 2004). “Hippocampal synaptic plasticity in mice devoid of cellular prion protein”. Brain Research. Molecular Brain Research. 131 (1-2): 58–64. doi:10.1016/j.molbrainres.2004.08.004
[7] Málaga-Trillo E, Solis GP, et al. (Mar 2009). Weissmann C, ed. “Regulation of embryonic cell adhesion by the prion protein”. PLoS Biology. 7 (3): e55. doi:10.1371/journal.pbio.1000055
[8] http://www.nature.com/mp/journal/vaop/ncurrent/full/mp201784a.html

Journal Club: Statistical database analysis of the role of loop dynamics for protein-protein complex formation and allostery

As I’ve mentioned on this blog a few (ok, more than a few) times before, loops are often very important regions of a protein, allowing it to carry out its function effectively. In my own research, I develop methods for loop structure prediction (in particular for antibody CDR H3), and look at loop conformational changes and flexibility. So, when I came across a paper that has the words ‘loops’, ‘flexibility’ and ‘antibody’ in its abstract, it was the obvious choice to present at my most recent journal club!

In the paper, entitled “Statistical database analysis of the role of loop dynamics for protein-protein complex formation and allostery”, the authors focus on how loop dynamics change upon the formation of protein-protein complexes. To do this, they use an algorithm they previously published called ToeLoop – given a protein structure, this classifies the loop regions as static, slow, or fast, based on both sequential and structural features:

relative amino acid frequencies;
the frequency of loop secondary structure types as annotated by DSSP (bends, β-bridges etc.);
the average solvent accessible surface area;
the average hydrophobicity index for the loop residues;
loop length;
contacts between atoms of the loop and the rest of the protein.

Two scores are calculated using the properties listed above: one that distinguishes ‘static’ loops from ‘mobile’ loops (with a reported 81% accuracy), and another that further categorises the mobile loops into ‘slow’ and ‘fast’ (74% accuracy). Results from the original ToeLoop paper indicate that fast loops are shorter, have more negatively charged residues, larger solvent accessibilities, lower hydrophobicity, and fewer contacts.

Gu et al. use ToeLoop to investigate the dynamic behaviour of loops during protein-protein complex formation. For a set of 230 protein complexes, they classified the loops of the proteins in both their free and complexed forms (illustrated by the figure below).

The loops from 230 protein complexes, in both free and bound forms, were categorised as fast, slow, or static using the ToeLoop algorithm. The loops are coloured according to their predicted dynamics. Allosteric loops, defined as those whose mobility increases upon binding, are indicated using blue arrows.

In the uncomplexed form, the majority of loops were annotated as static (63.6%), followed by slow (26.2%) and finally fast (10.2%). This indicates that most loops are inflexible. After complex formation, the number of static loops increases and the number of mobile loops decreases (67.8%, 23.0%, and 9.2% for static, slow and fast respectively). Mobility, on the whole, is therefore reduced upon binding, which is as expected – the presence of a binding partner restricts the range of possible movement.

The authors then divided the loops into two groups, interface and non-interface, according to the average minimum distance of each loop residue to the binding partner (cutoff values from 4 to 8 Å were tested and each gave broadly similar results). The dynamics of non-interface loops changed less upon binding than those of the interface loops (again, this was as expected). However, an interesting result is that slow loops are more common at the interface than any other parts of the protein, with 37.2% of interface loops being annotated as slow compared to 24.8% of non-interface loops. It is suggested by the authors that this is due to protein promiscuity; i.e. slow loops allow proteins to bind to different partners.

The 4600 loops analysed in the study were split into two groups based on their proximity to the interface. As expected, interface loops are affected more by binding than non-interface loops. Slow loops are more prevalent at the interface than elsewhere on the protein.

Binding-induced dynamic changes were then investigated in more detail, by dividing the loops into 9 categories based on the transition (i.e. static-static, slow-static, slow-fast etc.). The dynamic behaviour of most loops (4120 out of 4600) does not change, and those loops whose mobility decreased upon binding were found close to the interface (average distance of ~12 Å). A small subset of the loops (termed allosteric by the authors) demonstrated an increase in flexibility upon complex formation (142 out of 4600); these tended to be located further away from the interface (average distance of ~30 Å).

One of these allosteric loops was investigated further as part of a case study. The complex in question was an antibody-antigen complex, in which one loop distant from the binding site transitioned from static to slow upon binding. The loops directly involved in binding (the CDRs) either displayed reduced flexibility or remained static. The presence of an allosteric loop was supported by experimental data – the loop is shown to change conformation upon binding (RMSD of 3.6 Å between bound and unbound crystal structures from the PDB), and the average B-factor for the loop atoms increased on complex formation from around 26 Å² to approximately 140 Å². The authors also carried out MD simulations of the unbound antibody and antigen as well as the complex, and showed that the loop moved more in the complex than in the free antibody. The authors propose that the increased flexibility of the loop offsets the entropy loss that occurs due to binding, thereby increasing the strength of binding. ToeLoop could, therefore, be a useful tool in the development of antibody therapies (or other protein drugs) – it could be used in tandem with an antibody modelling protocol, allowing the dynamic behaviour of loop regions to be monitored and possibly designed to increase affinities.

Finally, the authors explored the link between loop dynamics and binding affinity. Again, they used ToeLoop to predict the flexibility of loops, but this time the complexes were from a set of 170 with known affinity. They demonstrated that affinity is correlated with the number of static loop residues present at the interface – ‘strong’ binders (those with picomolar affinity) tend to contain more static residues than more weakly binding pairs of proteins. This is in accordance with the theory that the rigidification of flexible loops upon binding leads to lower affinities, due to the loss of entropy.

Typography in graphs.

Typography [tʌɪˈpɒɡrəfi]
n.: the style and appearance of printed matter.

Perhaps a “glossed” feature of making graphs, having the right font goes a long way. Not only do we have the advantage of using a “pretty” font that we like, it also provides an aesthetic satisfaction of having everything (e.g. in a PhD thesis) in the same font, i.e. both the text and graph use the same font.

Fonts can be divided into two types: serif and sans-serif. Basically, serif fonts are those where the letters have little “bits” at the end; think of Times New Roman or Garamond as the classic examples. Sans-serif fonts are those that lack these bits, and give it a more “blocky”, clean finish – think of Arial or Helvetica as a classic example.

Typically, serif fonts are better for books/printed materials, whereas sans-serif fonts are better for web/digital content. As it follows, then what about graphs? Especially those that may go out in the public domain (whether it’s through publishing, or in a web site)?

This largely bottles down to user preference, and choosing the right font is not trivial. Supposing that you have (say, from Google Fonts), then there are a few things we need to do (e.g. make sure that your TeX distribution and Illustrator have the font). However, this post is concerned with how we can use custom fonts in a graph generated by Matplotlib, and why this is useful. My favourite picks for fonts include Roboto and Palatino.

The default font in matplotlib isn’t the prettiest ( I think) for publication/keeping purposes, but I digress…

To start, let’s generate a histogram of 1000 random numbers from a normal distribution.

The default font in matplotlib, bitstream sans, isn’t the prettiest thing on earth. Does the job but it isn’t my go-to choice if I can change it. Plus, with lots of journals asking for Type 1/TrueType fonts for images, there’s even more reason to change this anyway (matplotlib, by default, generates graphs using Type 3 fonts!). If we now change to Roboto or Palatino, we get the following:

Sans-serif Roboto.

Serif font Palatino.

Basically, the bits we need to include at the beginning of our code are here:

# Need to import matplotlib options setting method
# Set PDF font types - not necessary but useful for publications
from matplotlib import rcParams
rcParams['pdf.fonttype'] = 42

# For sans-serif
from matplotlib import rc
rc("font", **{"sans-serif": ["Roboto"]}

# For serif - matplotlib uses sans-serif family fonts by default
# To render serif fonts, you also need to tell matplotlib to use LaTeX in the backend.
rc("font", **{"family": "serif", "serif": ["Palatino"]})
rc("text", usetex = True)

This not only guarantees that images are generated using a font of our choice, but it gives a Type 1/TrueType font too. Ace!

Happy plotting.

Using Random Forests in Python with Scikit-Learn

I spend a lot of time experimenting with machine learning tools in my research; in particular I seem to spend a lot of time chasing data into random forests and watching the other side to see what comes out. In my many hours of Googling “random forest foobar” a disproportionate number of hits offer solutions implemented in R. As a young Pythonista in the present year I find this a thoroughly unacceptable state of affairs, so I decided to write a crash course in how to build random forest models in Python using the machine learning library scikit-learn (or sklearn to friends). This is far from exhaustive, and I won’t be delving into the machinery of how and why we might want to use a random forest. Rather, the hope is that this will be useful to anyone looking for a hands-on introduction to random forests (or machine learning in general) in Python.

In the future I’ll write a more in-depth post on how a few libraries turn Python into a powerful environment for data handling and machine learning. Until then, though, let’s jump into random forests!

Toy datasets

Sklearn comes with several nicely formatted real-world toy data sets which we can use to experiment with the tools at our disposal. We’ll be using the venerable iris dataset for classification and the Boston housing set for regression. Sklearn comes with a nice selection of data sets and tools for generating synthetic data, all of which are well-documented. Now, let’s write some Python!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
iris = datasets.load_iris()

Classification using random forests

First we’ll look at how to do solve a simple classification problem using a random forest. The iris dataset is probably the most widely-used example for this problem and nicely illustrates the problem of classification when some classes are not linearly separable from the others.

First we’ll load the iris dataset into a pandas dataframe. Pandas is a nifty Python library which provides a data structure comparable to the dataframes found in R with database style querying. As an added bonus, the seaborn visualization library integrates nicely with pandas allowing us to generate a nice scatter matrix of our data with minimal fuss.

df = pd.DataFrame(iris.data, columns=iris.feature_names)

# sklearn provides the iris species as integer values since this is required for classification
# here we're just adding a column with the species names to the dataframe for visualisation
df['species'] = np.array([iris.target_names[i] for i in iris.target])

sns.pairplot(df, hue='species')

Neat. Notice that iris-setosa is easily identifiable by petal length and petal width, while the other two species are much more difficult to distinguish. We could do all sorts of pre-processing and exploratory analysis at this stage, but since this is such a simple dataset let’s just fire on. We’ll do a bit of pre-processing later when we come to the Boston data set.

First, let’s split the data into training and test sets. We’ll used stratified sampling by iris class to ensure both the training and test sets contain a balanced number of representatives of each of the three classes. Sklearn requires that all features and targets be numeric, so the three classes are represented as integers (0, 1, 2). Here we’re doing a simple 50/50 split because the data are so nicely behaved. Typically however we might use a 75/25 or even 80/20 training/test split to ensure we have enough training data. In true Python style this is a one-liner.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[iris.feature_names], iris.target, test_size=0.5, stratify=iris.target, random_state=123456)

Now let’s fit a random forest classifier to our training set. For the most part we’ll use the default settings since they’re quite robust. One exception is the out-of-bag estimate: by default an out-of-bag error estimate is not computed, so we need to tell the classifier object that we want this.

If you’re used to the R implementation, or you ever find yourself having to compare results using the two, be aware that some parameter names and default settings are different between the two. Fortunately both have excellent documentation so it’s easy to ensure you’re using the right parameters if you ever need to compare models.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123456)
rf.fit(X_train, y_train)

Let’s see how well our model performs when classifying our unseen test data. For a random forest classifier, the out-of-bag score computed by sklearn is an estimate of the classification accuracy we might expect to observe on new data. We’ll compare this to the actual score obtained on our test data.

from sklearn.metrics import accuracy_score

predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)

print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')

Out-of-bag score estimate: 0.973
Mean accuracy score: 0.933

Not bad. However, this doesn’t really tell us anything about where we’re doing well. A useful technique for visualising performance is the confusion matrix. This is simply a matrix whose diagonal values are true positive counts, while off-diagonal values are false positive and false negative counts for each class against the other.

from sklearn.metrics import confusion_matrix

cm = pd.DataFrame(confusion_matrix(y_test, predicted), columns=iris.target_names, index=iris.target_names)
sns.heatmap(cm, annot=True)

This lets us know that our model correctly separates the setosa examples, but exhibits a small amount of confusion when attempting to distinguish between versicolor and virginica.

Random forest regression

Now let’s look at using a random forest to solve a regression problem. The Boston housing data set consists of census housing price data in the region of Boston, Massachusetts, together with a series of values quantifying various properties of the local area such as crime rate, air pollution, and student-teacher ratio in schools. The question for us is whether we can use these data to accurately predict median house prices. One caveat of this data set is that the median house price is truncated at $50,000 which suggests that there may be considerable noise in this region of the data. You might want to remove all data with a median house price of $50,000 from the set and see if the regression improves at all.

As before we’ll load the data into a pandas dataframe. This time, however, we’re going to do some pre-processing of our data by independently transforming each feature to have zero mean and unit variance. The values of different features vary greatly in order of magnitude. If we were to analyse the raw data as-is, we run the risk of our analysis being skewed by certain features dominating the variance. This isn’t strictly necessary for a random forest, but will enable us to perform a more meaningful principal component analysis later. Performing this transformation in sklearn is super simple using the StandardScaler class of the preprocessing module. This time we’re going to use an 80/20 split of our data. You could bin the house prices to perform stratified sampling, but we won’t worry about that for now.

boston = datasets.load_boston()

features = pd.DataFrame(boston.data, columns=boston.feature_names)
targets = boston.target

As before, we’ve loaded our data into a pandas dataframe. Notice how I have to construct new dataframes from the transformed data. This is because sklearn is built around numpy arrays. While it’s possible to return a view of a dataframe as an array, transforming the contents of a dataframe requires a little more work. Of course, there’s a library for that, but I’m lazy so I didn’t use it this time.

from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(features, targets, train_size=0.8, random_state=42)

scaler = StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(scaler.transform(X_train), index=X_train.index.values, columns=X_train.columns.values)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), index=X_test.index.values, columns=X_test.columns.values)

With the data standardised, let’s do a quick principal-component analysis to see if we could reduce the dimensionality of the problem. This is quick and easy in sklearn using the PCA class of the decomposition module.

from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_train)
cpts = pd.DataFrame(pca.transform(X_train))
x_axis = np.arange(1, pca.n_components_+1)
pca_scaled = PCA()
pca_scaled.fit(X_train_scaled)
cpts_scaled = pd.DataFrame(pca.transform(X_train_scaled))

# matplotlib boilerplate goes here

Notice how without data standardisation the variance is completely dominated by the first principal component. With standardisation, however, we see that in fact we must consider multiple features in order to explain a significant proportion of the variance. You might want to experiment with building regression models using the principal components (or indeed just combinations of the raw features) to see how well you can do with less information. For now though we’re going to use all of the (scaled) features as the regressors for our model. As with the classification problem fitting the random forest is simple using the RandomForestRegressor class.

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=500, oob_score=True, random_state=0)
rf.fit(X_train, y_train)

Now let’s see how we do on our test set. As before we’ll compare the out-of-bag estimate (this time it’s an R-squared score) to the R-squared score for our predictions. We’ll also compute Spearman rank and Pearson correlation coefficients for our predictions to get a feel for how we’re doing.

from sklearn.metrics import r2_score
from scipy.stats import spearmanr, pearsonr

predicted_train = rf.predict(X_train)
predicted_test = rf.predict(X_test)

test_score = r2_score(y_test, predicted_test)
spearman = spearmanr(y_test, predicted_test)
pearson = pearsonr(y_test, predicted_test)

print(f'Out-of-bag R-2 score estimate: {rf.oob_score_:>5.3}')
print(f'Test data R-2 score: {test_score:>5.3}')
print(f'Test data Spearman correlation: {spearman[0]:.3}')
print(f'Test data Pearson correlation: {pearson[0]:.3}')

Out-of-bag R-2 score estimate: 0.841
Test data R-2 score: 0.886
Test data Spearman correlation: 0.904
Test data Pearson correlation: 0.942

Not too bad, though there are a few outliers that would be worth looking into. Your challenge, should you choose to accept it, is to see if removing the $50,000 data improves the regression.

Wrapping up

Congratulations on making it this far. Now you know how to pre-process your data and build random forest models all from the comfort of your iPython session. I plan on writing more in the future about how to use Python for machine learning, and in particular how to make use of some of the powerful tools available in sklearn (a pipeline for data preparation, model fitting, prediction, in one line of Python? Yes please!), and how to make sklearn and pandas play nicely with minimal hassle. If you’re lucky, and if I can bring myself to process the data nicely, I might include some fun examples from less well-behaved real-world data sets.

Until then, though, happy Pythoning!

Biological Space – a starting point in in-silico drug design and in experimentally exploring biological systems

What is the “biological space” and why is this space so important for all researchers interested in developing novel drugs? In the following, I will first establish a definition of the biological space and then highlight its use in computationally developing novel drug compounds and as a starting point in the experimental exploration of biological systems.

While chemical space has been defined as the entirety of all possible chemical compounds which could ever exist, the definition of biological space is less clear. In the following, I define biological space as the area(s) of chemical space that possess biologically active (”bioactive”) compounds for a specific target or target class¹. As such, they can modulate a given biological system and subsequently influence disease development and progression. In literature, this space has also been called “biologically relevant chemical space”².

Only a small percentage of the vast chemical space has been estimated to be biologically active and is thus relevant for drug development, as randomly searching bioactive compounds in chemical space with no prior information resembles the search for “the needle in a haystack”. Hence, it should come as no surprise that bioactive molecules are often used as a starting point in in-silico explorations of biological space.
The plethora of in-silico methods for this task includes similarity and pharmacophore searching methods^3-6 for novel compounds, scaffold-hopping approaches to derive novel chemotypes^7-8 or the development of quantitative structure-activity relationships (QSAR)^9-10 to explore the interplay between the 3D chemical structure and its biological activity towards a specific target.

The biological space is comprised of small molecules which are active on specific targets. If researchers want to explore the role the role of targets in a given biological system experimentally, they can use small molecules which are potent and selective towards a specific target (thus confided to a particular area in chemical space)^11-12.
Due to their high selectivity ( f.e. a greater than 30-fold selectivity towards proteins of the same family¹²), these so-called “tool compounds” can help establish the biological tractability – the relationship between the target and a given phenotype – and its clinical tractability – the availability of biomarkers – of a target¹¹. They are thus highly complementary to methods such as RNAi, CRISPR¹² and knock-out animals¹¹. Consequently, tool compounds are used in drug target validation and the information they provide on the biological system can increase the probability of a successful drug ¹¹. Most importantly, tool compounds are particularly important to annotate targets in currently unexplored biological systems and thus important for novel drug development¹³.

Sophie Petit-Zeman, http://www.nature.com/horizon/chemicalspace/background/figs/explore_b1.html, accessed on 03.07.2016.
Koch, M. A. et al. Charting biologically relevant chemical space: a structural classification of natural products (SCONP). Proceedings of the National Academy of Sciences of the United States of America 102, 17272–17277 (2005).
Stumpfe, D. & Bajorath, J. Similarity searching. Wiley Interdisciplinary Reviews: Computational Molecular Science 1, 260–282 (2011).
Bender, A. et al. How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space. Journal of Chemical Information and Modeling 49, 108–119 (2009).
Ai, G. et al. A combination of 2D similarity search, pharmacophore, and molecular docking techniques for the identification of vascular endothelial growth factor receptor-2 inhibitors: Anti-Cancer Drugs 26, 399–409 (2015).
Willett, P., Barnard, J. M. & Downs, G. M. Chemical Similarity Searching. Journal of Chemical Information and Computer Sciences 38, 983–996 (1998)
Sun, H., Tawa, G. & Wallqvist, A. Classification of scaffold-hopping approaches. Drug Discovery Today 17, 310–324 (2012).
Hu, Y., Stumpfe, D. & Bajorath, J. Recent Advances in Scaffold Hopping: Miniperspective. Journal of Medicinal Chemistry 60, 1238–1246 (2017)
Cruz-Monteagudo, M. et al. Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? Drug Discovery Today 19, 1069–1080 (2014).
Bradley, A. R., Wall, I. D., Green, D. V. S., Deane, C. M. & Marsden, B. D. OOMMPPAA: A Tool To Aid Directed Synthesis by the Combined Analysis of Activity and Structural Data. Journal of Chemical Information and Modeling 54, 2636–2646 (2014).
Garbaccio, R. & Parmee, E. The Impact of Chemical Probes in Drug Discovery: A Pharmaceutical Industry Perspective. Cell Chemical Biology 23, 10–17 (2016).
Arrowsmith, C. H. et al. The promise and peril of chemical probes. Nature Chemical Biology 11, 536–541 (2015).
Fedorov, O., Müller, S. & Knapp, S. The (un) targeted cancer kinome. Nature chemical biology 6, 166–169 (2010).

In MATLAB, it’s colormaps all the way down

My overriding emotion, working in R, has been incomprehension: incomprehension at the gallery of ugly gnomes that populate the namespace and worried puzzlement over the strange incantations required to get them to dance in a statistically harmonious way. But all that aside, I resolved, joining the group, to put aside my misgivings and give the gnomes another try.

Soon, I found myself engaged in a reassessment of my life choices. I realized that life’s too short to spend it tickling gnomes – especially when only one of them knows how to do linear regression, but he won’t tell you your p value unless you give him the right kinds of treats. I fired up MATLAB and I haven’t looked back.

However, there was issue of continued perplexity, and I’m not referring to why MATLAB insists on shouting itself at you. I need to make a lot of 2-D plots of protein distance matrices. The trouble is that I like to highlight parts of them, and that’s not straightforward in MATLAB. Let’s have a look at an example:

>> dists=dlmread('1hel.distances');
>> colormap gray;
>> imagesc(dists>8);
>> axis square;

Now, let’s load up a set of residues and try to overlay them on top of the first image:

>> resn=dlmread('1hel.resn');
>> mask = zeros(size(dists));
>> mask(resn,resn)=1;
>> hold on
>> imagesc(1-mask, 'AlphaData',mask*.5);

So far, so easy. To review the main points:

mask is a matrix which has a one at all the pixels that we want to highlight. But we use imagesc(1-mask) because the gray colormap displays black at 0 and white at 1. If we did imagesc(mask), we would end up with grey everywhere and white only where we hoped to highlight – the opposite effect from the one that we sought.

AlphaData is a property which sets the transparency of the image. We want the image to be fully transparent where mask is 0 – so as not to fog out the underlying image – and partially transparent where mask is 1. 0.5*mask is a matrix which is 0.5 everywhere that mask is 1 and 0 everywhere else. If we set 0.5*mask as the AlphaData property, then the colour we add will be at half transparency and the white areas will be fully transparent.

But this isn’t a very pleasant image. We want to be able to highlight the regions in some colour other than grey. Let’s try.

>> close all
>> imagesc(dists>8)
>> colormap gray
>> axis square
>> imagesc(1-mask, 'AlphaData',mask*.3,'ColorMap','jet');
Error using image
There is no ColorMap property on the Image class.

Error in imagesc (line 39)
hh = image(varargin{:},'CDataMapping','scaled');

No luck! What’s more, setting the colormap between calls to image() and imagesc() also doesn’t work. Here’s the problem: the colormap is a property of the figure, not the data. (More precisely, it is not a property of the MATLAB axes.) When you change the colormap, you change the colors of every datapoint in the image.

The fix

MATLAB’s colormap mechanism is just simple enough to be confusing. MATLAB stores colours as 1×3 vectors, where each element in the vector is the proportion of red, green, or blue, respectively. [1 1 1] is white, [0 0 0] is black, and [1 0 0] is a frightfully iridescent red. A colormap is just a list of colors – 64 will normally do – which change smoothly from from one colour to another. To have a look at the built-in MATLAB colormaps, see here.

image rounds every value in the matrix to the nearest whole number (call that number i) and plots that pixel with the color given by colormap(i,:). Zero or below gets the first entry in the colormap and any index higher than the maximum is displayed with the last color in the colormap. So: if we construct a new colormap by concatenating two colormaps – the first running from rows 1 to 64 and the second running from 65 to 128 – if we scale our data so that the minimum is 65 and the maximum is 128, the data will never use the first set of colors. And, likewise, if we scale so that the lowest value is 1 and the highest is 64, we will use the first colormap. This seems like the sort of thing that we could manage automatically – and should, in fact. So I set myself to replace image and imagesc so that they would accept a ColorMap parameter.

How would it work?

>> colormap bone
>> imagesc(dists>8)
>> hold on
>> imagesc(mask,'ColorMap',[0.5 0 0.5],'AlphaData',0.5*(mask>0))
>> axis square

Beautiful!

Implementation notes

image is implemented in the MATLAB Java source code, but imagesc is a wrapper to image, written directly in MATLAB code. Therefore, overloading image requires the new function to be placed in a special directory called @double, while imagesc can be placed anywhere (except it cannot be placed in @double). If you then want to call the original version of image(), you can use builtin(‘image’,arg1,arg2,…), whereas if you want to call the original imagesc, it is a right pain. Instead, I used type imagesc to extract the source of imagesc and I modified that source directly – obviating any need to call the original imagesc. For reference, though, the most efficient way works out to be to find the function with which('imagesc'), cd into the containing directory, create a function handle to imagesc, and then cd out. As I said, it’s a mess.
These edits break colorbars. I added a spacer entry in each colormap which stores the maximum and minimum ‘real’ values of the data – in case that is useful for when I get around to extending colorbar. colormap entries must be inside [0,1] so these data are stored in the first twelve decimal places of the colormap entries: a strange burlesque on floating points. It’s a hack, but for my purposes it works.
In addition to the standard colormaps, I often require a mask in a particular color. For this purpose it helps to have a colormap that smoothly varies from white to the color in question. It actually doesn’t matter if it varies from white or any other color – ultimately, I only use the full colour value, since I set the transparency of all other pixels to maximum – but either way, passing the colour on [0,1] scale or [0,255] scale sets a colormap which varies from white to that color.

The code is available on MATLAB File Exchange at this link and is installable by copying imagesc.m, bootleg_fp.m, and the directory @double into your working directory. The idea to concatenate colormaps is widely available online – for example, here.

A Day in the Life of a DPhil Student… that also rows for Oxford.

I couldn’t decide whether to write this blog post. However, I sifted through the archives of BLOPIG and found in the original post this excerpt:

“And if your an athlete, like Anna (Dr. Lewis) who crossed the atlantic in a rowing boat or Eleanor who used to row for the blues – what can I say, this is how we roll, or row [feeble attempt at humour] – thats a non-scientific but unique and interesting experience too (Idea #8). .”

Therefore I’ve decided that it might be an interesting post to look into what life is like when you are studying for a DPhil and also training for the blues. Rowing in particular is a controversial sport – I have heard of many stories advocating that rowing will be the absolute detriment to your DPhil. I’ve never felt pressured as part of OPIG to give up rowing – all of my supervisors have been very fair, in that if I get the work done then they accept this is part of my life. However, I realise all supervisors are not so understanding. I hope this blog post will give some insight into what it is like to trial for a Blues sport (in this case Women’s Lightweight Rowing), whilst studying for a DPhil at Oxford.

4:56 am – Alarm goes off. If its after September it’s dark, cold and likely raining. No breakfast as I will do the first training session fasted – just get dressed and go!

5:15 am – Leave the house with a bag full of kit, food for the day, laptop and papers to cycle to Iffley Sport’s Centre

5:45 am – Lightweight Women’s minibus leaves from Iffley to drive to Wallingford. Some girls try to study in the bus, but to be honest its too dark and we’re all a bit too sleepy.

6:15 am – Arrive at Wallingford. Get onto the water for a session in the boats. Although in the Boat Race we race in an 8 (8 rowers with one oar each, with a cox steering), we spend lots of time in different boats throughout the season. Perhaps unlike our openweight counterparts, we also do a lot of sculling (two oars per rower) as the only Olympic class boat for lightweight women is a sculling boat. We travel to Wallingford for a much longer, emptier stretch of river and normally get to see the sunrise.

8:10 am – We leave Wallingford to head back to Oxford. Start waiting in A LOT of traffic once you hit the ring road, and there’s a lot of panic in the bus about whether 9 am lectures will be made on time!

8:50 am – Arrive back at Iffley Sport’s Centre. Grab bike and cycle to the department.

9:00-9:15 am – Arrive at the Department. Quick shower to thaw frozen fingers and to not repulse my fellow OPIG members. I then get to eat warm porridge (highlight of the day) and go through my emails. I also check whether any of my jobs have finished on the group servers – one of the great perks of being in OPIG is the computational resources available to the group. Check the to-do list from yesterday and write a to-do list for today and get to work (coding, plotting results, reading papers or writing)!

11:00 am (Tuesdays & Thursdays) – Coffee morning! Although if it’s any time close to a race no bourbon biscuits or cake for me. This is a bit of an issue because at OPIG we eat a lot of cake. However, one member can usually be relied upon to eat my portion..

1:00 pm – Lunchtime! As a lightweight rower I am required to weigh-in at 59kg on the day of the Boat Race. If I am over that weight I don’t get to race. Therefore, I spend a portion of the year dieting to make sure I hit that target. The dieting lunch consists of soup and Greek yogurt. The post race non-dieting lunch consists of pasta from Taylors, chocolate and a Coke (yum!). OPIG members generally all have lunch at this time and enjoy solving the Times Cryptic Crossword. I’m not the best at crosswords so I normally chat to Laura and don’t concentrate.

2:00 pm – Back to work. Usually coding whilst listening to music. I normally start rushing to be able to submit some jobs to the group servers before I have to leave the office.

3:00 pm – Go to get a chocomilk with Clare. A chocomilk from the vending machine in our department costs 20p and is only 64 calories!

5:30 pm – Cycle to Iffley Sports Centre for the second training session of the day.

5:45 pm – If it’s light enough we hop in the minibus to go to Wallingford for another outing on the water. However, for most of the season its too dark and we head to the gym. This will either consist of weights to build strength, or we will use the indoor rowing machine (erg) to build fitness. The erg is my nemesis, so this is not a session I look forward to. Staring at a screen that constantly tells you how hard you are pushing, or if you are no longer pushing as hard I find to be psychologically quite tough. I’d much rather be gliding along the river.

8:35 pm – Leave Iffley after a long session to head home. Quickly down a Yazoo (strawberry milk) to boost recovery as I won’t be eating dinner until 45 minutes to an hour after the end of the session.

9:00 pm – Arrive home. I “cook” dinner which when I’m dieting consists of chucking sweet potato and healthy sausages from M&S in the oven while I pack my kit bag for the next day.

9:30 pm – Wolf down dinner and drink about a pint of milk, whilst finally catching up with my boyfriend about both our days.

10:00 pm – Bedtime at the latest.

Repeat!

When Does Chemical Elaboration Induce a Ligand To Change Its Binding Mode?

When Does Chemical Elaboration Induce a Ligand To Change Its Binding Mode?

For my journal club in June, I chose to present a Journal of Medicinal Chemistry article entitled “When Does Chemical Elaboration Induce a Ligand To Change Its Binding Mode?” by Malhotra and Karanicolas. This article uses a large scale collection of ligand pairs to investigate the circumstances in which elaborations of a ligand change the original binding mode.

One of the primary goals in medicinal chemistry is the optimisation of biological activity by chemical elaboration of a hit compound. This hit-to-lead optimisation often assumes that addition of functional groups to a given hit scaffold will not change the original binding mode.

In order to investigate the circumstances in which this assumption holds true and how often it holds true, they built up a large-scale collection of 297 related ligand pairs solved in complex with the same protein partner. Each pair consisted of a larger and smaller ligand; the larger ligand could have arisen from elaboration of the smaller ligand. They found that for 41 out of the 297 pairs (14%), the binding mode changed upon elaboration of the smaller ligand.

They investigated many physicochemical properties of the ligand, the protein-ligand complex and the protein binding pocket. They summarise the statistical significance and predictive power of the investigated properties with the table shown below.

They found that the property with the lowest p-value was the “rmsd after minimisation of the aligned complex” (RMAC). They developed this metric to probe whether the larger ligand could be accommodated in the protein without changing binding mode. They did so by aligning the shared substructure of the larger ligand onto the smaller ligand’s complex and then carrying out an energy minimisation. By monitoring the RMSD difference of the larger ligand relative to the initial pose (RMAC), they can gauge how compatible the larger ligand is with the protein. Larger RMAC values indicate greater incompatibility, hence a greater likelihood for the binding mode to not be preserved.

The authors generated receiver operating characteristic (ROC) plots to compare the predictive power of the properties considered. ROC curves are made by plotting the true positive rate (TPR) against the false positive rate (FPR). A random classifier would yield the dotted line from the bottom left to the top right, shown in the plots below. The best predictors would give a point in the top left corner of the plot. The properties that do well include RMAC, pocket volume, molecular weight, lipophilicity and potency.

They also combined properties to enhance predictive power and conclude that RMAC and molecular weight together offers good predictivity.Finally, the authors look at the pairs that have low RMAC values (i.e. the elaboration should be compatible with the protein pocket), yet show a change in binding mode. For these cases, a specific substitution may enable formation of a new, stronger interaction or for pseudosymmetric ligands, the alternate pose can mimic many of the interactions of the original pose.

Antibody Developability: Experimental Screening Assays

[This blog post is centered around the paper “Biophysical properties of the clinical-stage antibody landscape” (http://www.pnas.org/content/114/5/944.abstract) by Tushar Jain and coworkers. It is designed as a very basic intro for computational scientists into the world of experimental biophysical assays.]

A major concern in the development of antibody therapies is being able to predict “developability issues” at the screening stage, to avoid costly developmental dead-ends. Examples of such issues include an antibody being difficult to manufacture, possessing unsuitable pharmacodynamic or pharmokinetic profiles, having a propensity to aggregate (both in storage and in vivo) and being highly immunogenic.

This post is designed to give a clear and concise summary of the principles behind some of the most common biophysical experimental assays used to assess antibody candidates for future developability issues.

1. Ease of manufacture

HEK Titre (HEKt): This assay tests the expression level of the antibody (the higher the better). The heavy and light chain sequences are subcloned into vectors (such as pcDNA 3.4+, ThermoFisher) and these vectors are subsequently transfected into a suspension of Human embryonic kidney (HEK293) cells. After a set number of days the supernatant is harvested to assess the degree of expression.

2. Stability of 3D structure

Melting temperature using Differential Scanning Fluorimetry (Tm with DSF) Assay: This assay tests the thermal stability of the antibody. The higher the thermal stability, the less likely the protein will spontaneously unfold and become immunogenic. The antibody is mixed with a dye that fluoresces when in contact with hydrophobic regions, such as SPYRO orange. The mixture is then taken through a range of temperatures (eg. 40°C -> 95°C at a rate of 0.5°C/2min). As the protein begins to unfold, buried hydrophobic residues will become exposed and the level of fluorescence will suddenly increase. The value of T when the increase in fluorescence intensity is greatest gives us a Tm value.

(Further reading: http://www.beta-sheet.org/resources/T22-Niesen-fingerprinting_Oxford.pdf)

3. Stickiness assays (Aggregation propensity/Low solubility/High viscosity)

Affinity-capture Self-interaction Nanoparticle Spectroscopy (AC-SINS) Assay: This assay tests how likely an antibody is to interact with itself. It uses gold nanoparticles that are coated with anti-Fc antibodies. When a dilute solution of antibodies is added, they rapidly become immobilised on the gold beads. If these antibodies subsequently attract one another, it leads to shorter interatomic distances and longer absorption wavelengths that can be detected by spectroscopy.

(Further reading: https://www.ncbi.nlm.nih.gov/pubmed/24492294)

Clone Self-interaction by Bio-layer Interferometry (CSI-BLI) Assay: A more high-throughput method that uses a label-free technology to measure self-interaction. Antibodies are loaded onto the biosensor tip and white light is shone down the instrument to yield an internal reflection interference pattern. Then the tip is inserted into a solution of the same antibody, and if self-interaction occurs, then the interference pattern shifts by an amount proportional to the change in thickness of the biological layer. Images from: http://www.fortebio.com/bli-technology.html

(Further Reading: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3896597/)

Hydrophobic Interaction Chromatography (HIC) Assay: Antibodies are mixed into a polar mobile phase and then washed over a hydrophobic column. UV-absorbance or other techniques can then be used to determine the degree of adhesion.

(Further Reading: https://www.ncbi.nlm.nih.gov/pubmed/4094424)

Standup Monolayer Chromatography (SMAC) Assay: Antibodies are injected onto a pre-packed Zenix HPLC column and their retention times are calculated. The longer the retention time, the lower their colloidal stability and the more prone they are to aggregate.

(Further Reading: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4622974/)

Size-exclusion Chromatography (SEC) Assay: Antibodies are flowed through a column consisting of spherical beads with miniscule pores. Non-aggregated antibodies are small enough to get trapped in the pores, whereas aggregated antibodies will flow through the column more rapidly. Percentage aggregation can be worked out from the concentrations of the different fractions.

4. Degree of specificity

Cross-Interaction Chromatography (CIC) Assay: This assay measures an antibody’s retention time as it flows across a column conjugated with polyclonal human serum antibodies. If an antibody takes longer to exit the column, it indicates that its surface is likely to interact with several different in vivo targets.

(Further Reading: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3896597/)

Enzyme-linked Immunosorbent Assay (ELISA) – with common antigens or Baculovirus Particles (BVPs): Common antigens or BVPs are fixed onto a solid surface and then a solution containing the antibody of interest linked to an enzyme (such as horseradish peroxidase, HRP) is washed over them. Incubation lasts for about an hour before any unreacted antibodies are washed off. When the appropriate enzyme substrate is then added, it triggers emission of a visible, fluorescent or luminescent nature, which can be detected. The intensity is proportional to the amount of antibody stuck to the surface.

(Further Reading: https://www.thermofisher.com/uk/en/home/life-science/protein-biology/protein-biology-learning-center/protein-biology-resource-library/pierce-protein-methods/overview-elisa.html)

Poly-Specificity Reagent (PSR) Binding Assay: A more high-throughput method that uses fluorescence-activated cell sorting (FACS), a type of flow cytometry. A PSR is generated by biotinylating soluble membrane proteins (from Chinese hamster ovary (CHO) cells, for example) and then is incubated with IgG-presenting yeast. After washing a secondary labeling mix is added, and flow cytometry is used to determine a median fluorescence intensity – the higher the median intensity, the greater the chance of non-specific binding.

(Further Reading: https://www.ncbi.nlm.nih.gov/pubmed/24046438)

Oxford Protein Informatics Group

or "OPIG" to friends

Slowing the progress of prion diseases

Journal Club: Statistical database analysis of the role of loop dynamics for protein-protein complex formation and allostery

Typography in graphs.

Using Random Forests in Python with Scikit-Learn

Toy datasets

Classification using random forests

Random forest regression

Wrapping up

Biological Space – a starting point in in-silico drug design and in experimentally exploring biological systems

In MATLAB, it’s colormaps all the way down

The fix

Implementation notes

A Day in the Life of a DPhil Student… that also rows for Oxford.

When Does Chemical Elaboration Induce a Ligand To Change Its Binding Mode?

Antibody Developability: Experimental Screening Assays

1. Ease of manufacture

2. Stability of 3D structure

3. Stickiness assays (Aggregation propensity/Low solubility/High viscosity)

4. Degree of specificity