Category Archives: How To

How to do things. doh.

Crystallographic programming: Super short tour of the cctbx

Two of the leading packages in crystallography are Phenix and CCP4. For most practicing crystallographers they will interact via with these to progress a single crystallographic data-set from diffraction images, through integration, merging, phasing, model building and hopefully deposition.

However, if you want to develop crystallographic software, you will likely need to decide on a framework to build upon. Phenix is built on the comprehensive cctbx library, whereas CCP4 programs are typically standlone, although common crystallographic libraries such as clipper and cctbx are utilised.

CCTBX is written mainly in python, with core crystallographic functionality written in C++. My usual starting place for understanding functionality is through the pdb parser tutorial. This introduces the concept of a hierarchy, a iterative way to represent a macromolecule:

from iotbx.pdb import hierarchy
pdb_in = hierarchy.input(file_name="model.pdb")
for chain in pdb_in.hierarchy.only_model().chains() :
  for residue_group in chain.residue_groups() :
    for atom_group in residue_group.atom_groups() :
      for atom in atom_group.atoms() :
        if (atom.element.strip().upper() == "ZN") :
          atom_group.remove_atom(atom)
      if (atom_group.atoms_size() == 0) :
        residue_group.remove_atom_group(atom_group)
    if (residue_group.atom_groups_size() == 0) :
      chain.remove_residue_group(residue_group)
f = open("model_Zn_free.pdb", "w")
f.write(pdb_in.hierarchy.as_pdb_string(
  crystal_symmetry=pdb_in.input.crystal_symmetry()))
f.close()

Although there are many ways to parse a pdb file, the introduction to iotbx.pdb, gives a view of how xray structure data can be associated to the model. The tour of the cctbx can be helpful starting place, especially for understanding how the python and c++ functionality interact through boost and the scitbx.array_family.flex. Unfortunately, documentation on cctbx tends to vary in quality and quantity throughout the modules:

Other components of the library include ways to simulate crystallographic data through simtbx,  and tools for processing xfel data.

As the library is open source, github hosted source code allows exploration of previously written routines, which can be very helpful for understanding the inner workings of the library. Note that there are also bulletin boards for users and developers of phenix and cctbx respectively. A few tutorials can also be found.

Hopefully this post will give someone other than me a reminder of where to find resources to get started developing within CCTBX.

Drawing Networks in LaTeX with tikz-network

While researching on protein interaction networks it is often important to illustrate networks. For this many different tools are available, for example, Python’s NetworkX and Matlab, that allow the export of figures as pixelated images or vector graphics. Usually, these figures are then incorporated in the papers, which are commonly written in LaTeX. In this post, I want to present `tikz-network’, which is a novel tool to code and illustrate networks directly in LaTeX.

To create an illustration you define the network’s nodes with their positions and edges between these nodes. An example of a simple network is

\begin{tikzpicture}
   \Vertex[color = blue]{A}
   \Vertex[x=3,y=1,color=red]{B}
   \Vertex[x=0,y=2,color=orange]{C}
   \Edge[lw=5pt](A)(B)
   \Edge[lw=3pt,bend=15,Direct](A)(C)
\end{tikzpicture}

The illustrations can be much more complex and allow dashed lines, opacity, and many other features. Importantly, the properties do not need to be specified in the LaTeX file itself but can also be saved in an external file and imported with the  \Vertices{data/vertices.csv}command. This allows the representation of more complex networks, for example the multilayer network below is created from the two files, the first representing the nodes

id, x, y ,size, color,opacity,label,layer 
A, 0, 0, .4 , green, .9 , a , 1
B, 1, .7, .6 , , .5 , b , 1
C, 2, 1, .8 ,orange, .3 , c , 1
D, 2, 0, .5 , red, .7 , d , 2
E,.2,1.5, .5 , gray, , e , 1
F,.1, .5, .7 , blue, .3 , f , 2
G, 2, 1, .4 , cyan, .7 , g , 2
H, 1, 1, .4 ,yellow, .7 , h , 2

and the second having the edge information:

u,v,label,lw,color ,opacity,bend,Direct
A,B, ab  ,.5,red   ,   1   ,  30,false
B,C, bc  ,.7,blue  ,   1   , -60,false
A,E, ae  , 1,green ,   1   ,  45,true
C,E, ce  , 2,orange,   1   ,   0,false
A,A, aa  ,.3,black ,  .5   ,  75,false
C,G, cg  , 1,blue  ,  .5   ,   0,false
E,H, eh  , 1,gray  ,  .5   ,   0,false
F,A, fa  ,.7,red   ,  .7   ,   0,true
D,F, df  ,.7,cyan  ,   1   ,   30,true
F,H, fh  ,.7,purple,   1   ,   60,false
D,G, dg  ,.7,blue  ,  .7   ,   60,false

For details, please see the extensive manual on the GitHub page of the project. It is a very new project and I only started using it but I like it so far for a couple of reasons:

  • it is easy to use, especially for small example graphs
  • the multilayer functionality is very convenient
  • included texts are automatically in the correct size and font with the rest of the LaTeX document
  • it can be combined with regular tikz commands to create more complex illustrations

Using Random Forests in Python with Scikit-Learn

I spend a lot of time experimenting with machine learning tools in my research; in particular I seem to spend a lot of time chasing data into random forests and watching the other side to see what comes out. In my many hours of Googling “random forest foobar” a disproportionate number of hits offer solutions implemented in R. As a young Pythonista in the present year I find this a thoroughly unacceptable state of affairs, so I decided to write a crash course in how to build random forest models in Python using the machine learning library scikit-learn (or sklearn to friends). This is far from exhaustive, and I won’t be delving into the machinery of how and why we might want to use a random forest. Rather, the hope is that this will be useful to anyone looking for a hands-on introduction to random forests (or machine learning in general) in Python.

In the future I’ll write a more in-depth post on how a few libraries turn Python into a powerful environment for data handling and machine learning. Until then, though, let’s jump into random forests!

Toy datasets

Sklearn comes with several nicely formatted real-world toy data sets which we can use to experiment with the tools at our disposal. We’ll be using the venerable iris dataset for classification and the Boston housing set for regression. Sklearn comes with a nice selection of data sets and tools for generating synthetic data, all of which are well-documented. Now, let’s write some Python!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
iris = datasets.load_iris()

Classification using random forests

First we’ll look at how to do solve a simple classification problem using a random forest. The iris dataset is probably the most widely-used example for this problem and nicely illustrates the problem of classification when some classes are not linearly separable from the others.

First we’ll load the iris dataset into a pandas dataframe. Pandas is a nifty Python library which provides a data structure comparable to the dataframes found in R with database style querying. As an added bonus, the seaborn visualization library integrates nicely with pandas allowing us to generate a nice scatter matrix of our data with minimal fuss.

df = pd.DataFrame(iris.data, columns=iris.feature_names)

# sklearn provides the iris species as integer values since this is required for classification
# here we're just adding a column with the species names to the dataframe for visualisation
df['species'] = np.array([iris.target_names[i] for i in iris.target])

sns.pairplot(df, hue='species')

Neat. Notice that iris-setosa is easily identifiable by petal length and petal width, while the other two species are much more difficult to distinguish. We could do all sorts of pre-processing and exploratory analysis at this stage, but since this is such a simple dataset let’s just fire on. We’ll do a bit of pre-processing later when we come to the Boston data set.

First, let’s split the data into training and test sets. We’ll used stratified sampling by iris class to ensure both the training and test sets contain a balanced number of representatives of each of the three classes. Sklearn requires that all features and targets be numeric, so the three classes are represented as integers (0, 1, 2). Here we’re doing a simple 50/50 split because the data are so nicely behaved. Typically however we might use a 75/25 or even 80/20 training/test split to ensure we have enough training data. In true Python style this is a one-liner.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[iris.feature_names], iris.target, test_size=0.5, stratify=iris.target, random_state=123456)

Now let’s fit a random forest classifier to our training set. For the most part we’ll use the default settings since they’re quite robust. One exception is the out-of-bag estimate: by default an out-of-bag error estimate is not computed, so we need to tell the classifier object that we want this.

If you’re used to the R implementation, or you ever find yourself having to compare results using the two, be aware that some parameter names and default settings are different between the two. Fortunately both have excellent documentation so it’s easy to ensure you’re using the right parameters if you ever need to compare models.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123456)
rf.fit(X_train, y_train)

Let’s see how well our model performs when classifying our unseen test data. For a random forest classifier, the out-of-bag score computed by sklearn is an estimate of the classification accuracy we might expect to observe on new data. We’ll compare this to the actual score obtained on our test data.

from sklearn.metrics import accuracy_score

predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)

print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')
Out-of-bag score estimate: 0.973
Mean accuracy score: 0.933

Not bad. However, this doesn’t really tell us anything about where we’re doing well. A useful technique for visualising performance is the confusion matrix. This is simply a matrix whose diagonal values are true positive counts, while off-diagonal values are false positive and false negative counts for each class against the other.

from sklearn.metrics import confusion_matrix

cm = pd.DataFrame(confusion_matrix(y_test, predicted), columns=iris.target_names, index=iris.target_names)
sns.heatmap(cm, annot=True)

This lets us know that our model correctly separates the setosa examples, but exhibits a small amount of confusion when attempting to distinguish between versicolor and virginica.

Random forest regression

Now let’s look at using a random forest to solve a regression problem. The Boston housing data set consists of census housing price data in the region of Boston, Massachusetts, together with a series of values quantifying various properties of the local area such as crime rate, air pollution, and student-teacher ratio in schools. The question for us is whether we can use these data to accurately predict median house prices. One caveat of this data set is that the median house price is truncated at $50,000 which suggests that there may be considerable noise in this region of the data. You might want to remove all data with a median house price of $50,000 from the set and see if the regression improves at all.

As before we’ll load the data into a pandas dataframe. This time, however, we’re going to do some pre-processing of our data by independently transforming each feature to have zero mean and unit variance. The values of different features vary greatly in order of magnitude. If we were to analyse the raw data as-is, we run the risk of our analysis being skewed by certain features dominating the variance. This isn’t strictly necessary for a random forest, but will enable us to perform a more meaningful principal component analysis later. Performing this transformation in sklearn is super simple using the StandardScaler class of the preprocessing module. This time we’re going to use an 80/20 split of our data. You could bin the house prices to perform stratified sampling, but we won’t worry about that for now.

boston = datasets.load_boston()

features = pd.DataFrame(boston.data, columns=boston.feature_names)
targets = boston.target

As before, we’ve loaded our data into a pandas dataframe. Notice how I have to construct new dataframes from the transformed data. This is because sklearn is built around numpy arrays. While it’s possible to return a view of a dataframe as an array, transforming the contents of a dataframe requires a little more work. Of course, there’s a library for that, but I’m lazy so I didn’t use it this time.

from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(features, targets, train_size=0.8, random_state=42)

scaler = StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(scaler.transform(X_train), index=X_train.index.values, columns=X_train.columns.values)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), index=X_test.index.values, columns=X_test.columns.values)

With the data standardised, let’s do a quick principal-component analysis to see if we could reduce the dimensionality of the problem. This is quick and easy in sklearn using the PCA class of the decomposition module.

from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_train)
cpts = pd.DataFrame(pca.transform(X_train))
x_axis = np.arange(1, pca.n_components_+1)
pca_scaled = PCA()
pca_scaled.fit(X_train_scaled)
cpts_scaled = pd.DataFrame(pca.transform(X_train_scaled))

# matplotlib boilerplate goes here

Notice how without data standardisation the variance is completely dominated by the first principal component. With standardisation, however, we see that in fact we must consider multiple features in order to explain a significant proportion of the variance. You might want to experiment with building regression models using the principal components (or indeed just combinations of the raw features) to see how well you can do with less information. For now though we’re going to use all of the (scaled) features as the regressors for our model. As with the classification problem fitting the random forest is simple using the RandomForestRegressor class.

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=500, oob_score=True, random_state=0)
rf.fit(X_train, y_train)

Now let’s see how we do on our test set. As before we’ll compare the out-of-bag estimate (this time it’s an R-squared score) to the R-squared score for our predictions. We’ll also compute Spearman rank and Pearson correlation coefficients for our predictions to get a feel for how we’re doing.

from sklearn.metrics import r2_score
from scipy.stats import spearmanr, pearsonr

predicted_train = rf.predict(X_train)
predicted_test = rf.predict(X_test)

test_score = r2_score(y_test, predicted_test)
spearman = spearmanr(y_test, predicted_test)
pearson = pearsonr(y_test, predicted_test)

print(f'Out-of-bag R-2 score estimate: {rf.oob_score_:>5.3}')
print(f'Test data R-2 score: {test_score:>5.3}')
print(f'Test data Spearman correlation: {spearman[0]:.3}')
print(f'Test data Pearson correlation: {pearson[0]:.3}')
Out-of-bag R-2 score estimate: 0.841
Test data R-2 score: 0.886
Test data Spearman correlation: 0.904
Test data Pearson correlation: 0.942

Not too bad, though there are a few outliers that would be worth looking into. Your challenge, should you choose to accept it, is to see if removing the $50,000 data improves the regression.

Wrapping up

Congratulations on making it this far. Now you know how to pre-process your data and build random forest models all from the comfort of your iPython session. I plan on writing more in the future about how to use Python for machine learning, and in particular how to make use of some of the powerful tools available in sklearn (a pipeline for data preparation, model fitting, prediction, in one line of Python? Yes please!), and how to make sklearn and pandas play nicely with minimal hassle. If you’re lucky, and if I can bring myself to process the data nicely, I might include some fun examples from less well-behaved real-world data sets.

Until then, though, happy Pythoning!

A very basic introduction to Random Forests using R

Random Forests is a powerful tool used extensively across a multitude of fields. As a matter of fact, it is hard to come upon a data scientist that never had to resort to this technique at some point. Motivated by the fact that I have been using Random Forests quite a lot recently, I decided to give a quick intro to Random Forests using R.

So what are Random Forests?  Well, I am probably not the most suited person to answer this question (a google search will reveal much more interesting answers) , still I shall give it a go. Random Forests is a learning method for classification (and others applications — see below). It is based on generating a large number of decision trees, each constructed using a different subset of your training set. These subsets are usually selected by sampling at random and with replacement from the original data set. The decision trees are then used to identify a classification consensus by selecting the most common output (mode). While random forests can be used for other applications (i.e. regression), for the sake of keeping this post short, I shall focus solely on classification.

Why R? Well, the quick and easy question for this is that I do all my plotting in R (mostly because I think ggplot2 looks very pretty). I decided to explore Random Forests in R and to assess what are its advantages and shortcomings. I am planning to compare Random Forests in R against the python implementation in scikit-learn. Do expect a post about this in the near future!

The data: to keep things simple, I decided to use the Edgar Anderson’s Iris Data set. You can have a look at it by inspecting the contents of iris in R. This data set contains observations for four features (sepal length and width, and petal length and width – all in cm) of 150 flowers, equally split between three different iris species. This data set is fairly canon in classification and data analysis. Let us take a look at it, shall we:

As you can observe, there seems to be some separation in regards to the different features and our three species of irises [note: this set is not very representative of a real world data set and results should be taken with a grain of salt].

Training and Validation sets: great care needs to be taken to ensure clear separation between training and validation sets. I tend to save the cases for which I am actually interested in performing predictions as a second validation set (Validation 2). Then I split the remaining data evenly into Training and Validation 1.

Let us split our data set then, shall we?

# Set random seed to make results reproducible:
set.seed(17)
# Calculate the size of each of the data sets:
data_set_size <- floor(nrow(iris)/2)
# Generate a random sample of "data_set_size" indexes
indexes <- sample(1:nrow(iris), size = data_set_size)

# Assign the data to the correct sets
training <- iris[indexes,]
validation1 <- iris[-indexes,]

Before we can move on, here are some things to consider:

1- The size of your data set usually imposes a hard limit on how many features you can consider. This occurs due to the curse of dimensionality, i.e. your data becomes sparser and sparser as you increase the number of features considered, which usually leads to overfitting. While there is no rule of thumb relating to how many features vs.  the number of observations you should use, I try to keep e^Nf < No (Nf = number of features, No = number of observations) to minimise overfitting [this is not always possible and it does not ensure that we won’t overfit]. In this case, our training set has 75 observations, which suggests that using four features (e^4 ~ 54.6) is not entirely absurd. Obviously, this depends on your data, so we will cover some further overfitting checks later on.

2- An important thing to consider when assembling training sets is the proportion of negatives vs. positives in your data. Think of an extreme scenario where you have many, many more observations for one class vs. the others. How will this affect classification? This would make it more likely for the classifier to predict the dominant class when given new values. I mentioned before that the iris set is quite nice to play with. It comes with exactly 50 observations for each species of irises. What happens if you have a data set with a much higher number of observations for a particular class? You can bypass any imbalance regarding the representation of each class by carefully constructing your training set in order not to favour any particular class. In this case, our randomly selected set has 21 observations for species setosa and 27 observations for each of species versicolor and virginica, so we are good to go.

3- Another common occurrence that is not represented by the iris data set is missing values (NAs) for observations. There are many ways of dealing with missing values, including assigning the median or the mode for that particular feature to the missing observation or even disregarding some observations entirely, depending on how many observations you have. There are even ways to use random forests to estimate a good value to assign to the missing observations, but for the sake of brevity, this will not be covered here.

Right, data sets prepared and no missing values, it is time to fire our random forests algorithm. I am using the  randomForest package. You can click the link for additional documentation. Here is the example usage code:

#import the package
library(randomForest)
# Perform training:
rf_classifier = randomForest(Species ~ ., data=training, ntree=100, mtry=2, importance=TRUE)

Note some important parameters:

-The first parameter specifies our formula: Species ~ . (we want to predict Species using each of the remaining columns of data).
ntree defines the number of trees to be generated. It is typical to test a range of values for this parameter (i.e. 100,200,300,400,500) and choose the one that minimises the OOB estimate of error rate.
mtry is the number of features used in the construction of each tree. These features are selected at random, which is where the “random” in “random forests” comes from. The default value for this parameter, when performing classification, is sqrt(number of features).
importance enables the algorithm to calculate variable importance.

We can quickly look at the results of our classifier for our training set by printing the contents of rf_classifier:

> rf_classifier

Call:
 randomForest(formula = Species ~ ., data = training,ntree=100,mtry=2, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 2

        OOB estimate of  error rate: 5.33%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         21          0         0  0.00000000
versicolor      0         25         2  0.07407407
virginica       0          2        25  0.07407407


As you can see, it lists the call used to build the classifier, the number of trees (100), the variables at each split (2), and it outputs a very useful confusion matrix and OOB estimate of error rate. This estimate is calculated by counting however many points in the training set were misclassified (2 versicolor and 2 virginica observations = 4) and dividing this number by the total number of observations (4/75 ~= 5.33%).

The OOB estimate of error rate is a useful measure to discriminate between different random forest classifiers. We could, for instance, vary the number of trees or the number of variables to be considered, and select the combination that produces the smallest value for this error rate. For more complicated data sets, i.e. when a higher number of features is present, a good idea is to use cross-validation to perform feature selection using the OOB error rate (see rfcv from randomForest for more details).

Remember the importance parameter? Let us take a look at the importance that our classifier has assigned to each variable:

varImpPlot(rf_classifier)

Each features’s importance is assessed based on two criteria:

-MeanDecreaseAccuracy: gives a rough estimate of the loss in prediction performance when that particular variable is omitted from the training set. Caveat: if two variables are somewhat redundant, then omitting one of them may not lead to massive gains in prediction performance, but would make the second variable more important.

-MeanDecreaseGini: GINI is a measure of node impurity. Think of it like this, if you use this feature to split the data, how pure will the nodes be? Highest purity means that each node contains only elements of a single class. Assessing the decrease in GINI when that feature is omitted leads to an understanding of how important that feature is to split the data correctly.

Do note that these measures are used to rank variables in terms of importance and, thus, their absolute values could be disregarded.

Ok, great. Looks like we have a classifier that was properly trained and is producing somewhat good predictions for our training set. Shall we evaluate what happens when we try to use this classifier to predict classes for our  validation1 set?

# Validation set assessment #1: looking at confusion matrix
prediction_for_table <- predict(rf_classifier,validation1[,-5])
table(observed=validation1[,5],predicted=prediction_for_table)

            predicted
observed     setosa versicolor virginica
  setosa         29          0         0
  versicolor      0         20         3
  virginica       0          1        22

The confusion matrix is a good way of looking at how good our classifier is performing when presented with new data.

Another way of assessing the performance of our classifier is to generate a ROC curve and compute the area under the curve:

 

# Validation set assessment #2: ROC curves and AUC

# Needs to import ROCR package for ROC curve plotting:
library(ROCR)

# Calculate the probability of new observations belonging to each class
# prediction_for_roc_curve will be a matrix with dimensions data_set_size x number_of_classes
prediction_for_roc_curve <- predict(rf_classifier,validation1[,-5],type="prob")

# Use pretty colours:
pretty_colours <- c("#F8766D","#00BA38","#619CFF")
# Specify the different classes 
classes <- levels(validation1$Species)
# For each class
for (i in 1:3)
{
 # Define which observations belong to class[i]
 true_values <- ifelse(validation1[,5]==classes[i],1,0)
 # Assess the performance of classifier for class[i]
 pred <- prediction(prediction_for_roc_curve[,i],true_values)
 perf <- performance(pred, "tpr", "fpr")
 if (i==1)
 {
     plot(perf,main="ROC Curve",col=pretty_colours[i]) 
 }
 else
 {
     plot(perf,main="ROC Curve",col=pretty_colours[i],add=TRUE) 
 }
 # Calculate the AUC and print it to screen
 auc.perf <- performance(pred, measure = "auc")
 print(auc.perf@y.values)
}

Here is the final product (ROC curve):

And here are the values for our AUCs:

Setosa
AUC = 1

Versicolor
AUC = 0.98

Virginica
AUC = 0.98

Voila! I hope this was somewhat useful!

Parallel Computing: GNU Parallel

Recently I started using the OPIG servers to run the algorithm I have developed (CRANkS) on datasets from DUDE (Database of Useful Decoys Enhanced).

This required learning how to run jobs in parallel. Previously I had been using computer clusters with their own queuing system (Torque/PBS) which allowed me to submit each molecule to be scored by the algorithm as a separate job. The queuing system would then automatically allocate nodes to jobs and execute jobs accordingly. On a side note I learnt how to submit these jobs an array, which was preferable to submitting ~ 150,000 separate jobs:

qsub -t 1:X array_submit.sh

where the contents of array_submit.sh would be:

#!/bin/bash
./$SGE_TASK_ID.sh

which would submit jobs 1.sh to X.sh, where X is the total number of jobs.

However the OPIG servers do not have a global queuing system to use. I needed a way of being able to run the code I already had in parallel with minimal changes to the workflow or code itself. There are many ways to run jobs in parallel, but to minimise work for myself, I decided to use GNU parallel [1].

This is an easy-to-use shell tool, which I found quick and easy to install onto my home server, allowing me to access it on each of the OPIG servers.

To use it I simply run the command:

cat submit.sh | parallel -j Y

where Y is the number of cores to run the jobs on, and submit.sh contains:

./1.sh
./2.sh
...
./X.sh

This executes each job making use of Y number of cores when available to run the jobs in parallel.

Quick, easy, simple and minimal modifications needed! Thanks to Jin for introducing me to GNU Parallel!

[1] O. Tange (2011): GNU Parallel – The Command-Line Power Tool, The USENIX Magazine, February 2011:42-47.

Using RDKit to load ligand SDFs into Pandas DataFrames

If you have downloaded lots of ligand SDF files from the PDB, then a good way of viewing/comparing all their properties would be to load it into a Pandas DataFrame.

RDKit has a very handy function just for this – it’s found under the PandasTool module.

I show an example below within Jupypter-notebook, in which I load in the SDF file, view the table of molecules and perform other RDKit functions to the molecules.

First import the PandasTools module:

from rdkit.Chem import PandasTools

Read in the SDF file:

SDFFile = "./Ligands_noHydrogens_noMissing_59_Instances.sdf"
BRDLigs = PandasTools.LoadSDF(SDFFile)

You can see the whole table by calling the dataframe:

BRDLigs

The ligand properties in the SDF file are stored as columns. You can view what these properties are, and in my case I have loaded 59 ligands each having up to 26 properties:

BRDLigs.info()

It is also very easy to perform other RDKit functions on the dataframe. For instance, I noticed there is no heavy atom column, so I added my own called ‘NumHeavyAtoms’:

BRDLigs['NumHeavyAtoms']=BRDLigs.apply(lambda x: x['ROMol'].GetNumHeavyAtoms(), axis=1)

Here is the column added to the table, alongside columns containing the molecules’ SMILES and RDKit molecule:

BRDLigs[['NumHeavyAtoms','SMILES','ROMol']]

How to Calculate PLIFs Using RDKit and PLIP

Protein-Ligand interaction fingerprints (PLIFs) are becoming more widely used to compare small molecules in the context of a protein target. A fingerprint is a bit vector that is used to represent a small molecule. Fingerprints of molecules can then be compared to determine the similarity between two molecules. Rather than using the features of the ligand to build the fingerprint, a PLIF is based on the interactions between the protein and the small molecule. The conventional method of building a PLIF is that each bit of the bit vector represents a residue in the binding pocket of the protein. The bit is set to 1 if the molecule forms an interaction with the residue, whereas it is set to 0 if it does not.

Constructing a PLIF therefore consists of two parts:

  1. Calculating the interactions formed by a small molecule from the target
  2. Collating this information into a bit vector.

Step 1 can be achieved by using the Protein-Ligand Interaction Profiler (PLIP). PLIP is an easy-to-use tool, that given a pdb file will calculate the interactions between the ligand and protein. This can be done using the online web-tool or alternatively using the command-line tool. Six different interaction types are calculated: hydrophobic, hydrogen-bonds, water-mediated hydrogen bonds, salt bridges, pi-pi and pi-cation. The command-line version outputs an xml report file containing all the information required to construct a PLIF.

Step 2 involves manipulating the output of the report file into a bit vector. RDKit is an amazingly useful Cheminformatics toolkit with great documentation. By reading the PLIF into an RDKit bit vector this allows the vector to be manipulated as an RDKit fingerprint. The fingerprints can then be compared using RDKit functionality very easily, for example, using Tanimoto Similarity.

EXAMPLE:

Let’s take 3 pdb files as an example. Fragment screening data from the SGC is a great sort of data for this analysis, as it contains lots of pdb structures of small hits bound to the same target. The data can be found here. For this example I will use 3 protein-ligand complexes from the BRD1 dataset: BRD1A-m004.pdb, BRD1A-m006.pdb and BRD1A-m009.pdb.

brd1_sgc

1.PLIP First we need to run plip to generate a report file for each protein-ligand complex. This is done using:


 

plipcmd -f BRD1A-m004.pdb -o m004 -x

plipcmd -f BRD1A-m006.pdb -o m006 -x

plipcmd -f BRD1A-m009.pdb -o m009 -x

 


A report file (‘report.xml’) is created for each pdb file within the directory m004, m006 and m009.

2. Get Interactions: Using a python script the results of the report can be collated using the function “generate_plif_lists” (shown below) on each report file. The function takes in the report file name, and the residues already found to be in the binding site (residue_list). “residue_list” must be updated for each molecule to be compared as the residues used to define the binding site can vary betwen each report file. The function then returns the updated “residue_list”, as well as a list of residues found to interact with the ligand: “plif_list_all”.

 


import xml.etree.ElementTree as ET

################################################################################

def generate_plif_lists(report_file, residue_list, lig_ident):

    #uses report.xml from PLIP to return list of interacting residues and update list of residues in binding site

        plif_list_all = []

        tree = ET.parse(report_file)

        root = tree.getroot()

        #list of residue keys that form an interaction

        for binding_site in root.findall('bindingsite'):

                nest = binding_site.find('identifiers')

                lig_code = nest.find('hetid')

                if str(lig_code.text) == str(lig_ident):

                        #get the plifs stuff here

                        nest_residue = binding_site.find('bs_residues')

                        residue_list_tree = nest_residue.findall('bs_residue')

                        for residue in residue_list_tree:

                                res_id = residue.text

                                dict_res_temp = residue.attrib

                                if res_id not in residue_list:

                                        residue_list.append(res_id)

                                if dict_res_temp['contact'] == 'True':

                                        if res_id not in plif_list_all:

                                                plif_list_all.append(res_id)

        return plif_list_all, residue_list

###############################################################################

plif_list_m006, residue_list = generate_plif_lists('m006/report.xml',residue_list, 'LIG')

plif_list_m009, residue_list = generate_plif_lists('m009/report.xml', residue_list, 'LIG')

plif_list_m004, residue_list = generate_plif_lists('m004/report.xml', residue_list, 'LIG')


3. Read Into RDKit: Now we have the list of binding site residues and which residues are interacting with the ligand a PLIF can be generated. This is done using the function shown below (“generate_rdkit_plif”):


from rdkit import Chem,  DataStructs

from rdkit.DataStructs import cDataStructs

################################################################################

def generate_rdkit_plif(residue_list, plif_list_all):

    #generates RDKit plif given list of residues in binding site and list of interacting residues

    plif_rdkit = DataStructs.ExplicitBitVect(len(residue_list), False)

    for index, res in enumerate(residue_list):

        if res in plif_list_all:

            print 'here'

            plif_rdkit.SetBit(index)

        else:

            continue

    return plif_rdkit

#########################################################################

plif_m006 = generate_rdkit_plif(residue_list, plif_list_m006)

plif_m009 = generate_rdkit_plif(residue_list, plif_list_m009)

plif_m004 = generate_rdkit_plif(residue_list, plif_list_m004)


4. Play! These PLIFs can now be compared using RDKit functionality. For example the Tanimoto similarity between the ligands can be computed:


def similarity_plifs(plif_1, plif_2):

    sim = DataStructs.TanimotoSimilarity(plif_1, plif_2)

    print sim

    return sim

###################################################################

print similarity_plifs(plif_m006, plif_m009)

print similarity_plifs(plif_m006, plif_m004)

print similarity_plifs(plif_m009, plif_m004)


The output is: 0.2, 0.5, 0.0.

All files used to generate the PLIFs cound be found here. Happy PLIF-making!

Viewing 3D molecules interactively in Jupyter iPython notebooks

Greg Landrum, curator of the invaluable open source cheminformatics API, RDKit, recently blogged about viewing molecules in a 3D window within a Jupyter-hosted iPython notebook (as long as your browser supports WebGL, that is).

The trick is to use py3Dmol. It’s easy to install:

pip install py3Dmol

This is built on the object-oriented, webGL based JavaScript library for online molecular visualization 3Dmol.js (Rego & Koes, 2015); here's a nice summary of the capabilities of 3Dmol.js. It's features include:

  • support for pdb, sdf, mol2, xyz, and cube formats
  • parallelized molecular surface computation
  • sphere, stick, line, cross, cartoon, and surface styles
  • atom property based selection and styling
  • labels
  • clickable interactivity with molecular data
  • geometric shapes including spheres and arrows

I tried a simple example and it worked beautifully:

import py3Dmol
view = py3Dmol.view(query='pdb:1hvr')
view.setStyle({'cartoon':{'color':'spectrum'}})
view

py3dmol_in_jupyter_ipython

The 3Dmol.js website summarizes how to view molecules, along with how to choose representations, how to embed it, and even how to develop with it.

References

Nicholas Rego & David Koes (2015). “3Dmol.js: molecular visualization with WebGL”.
Bioinformatics, 31 (8): 1322-1324. doi:10.1093/bioinformatics/btu829

Plotting and storing a 3D network in R

A simple toy example of a three layered network:

Note 1: In order to view the 3D plots, mac users will need Xquartz  installed (https://www.xquartz.org/).

 
require(igraph)
require(rgl)
#Another package that might be needed is "rglwidget". The function writeWebGL will show an error stating if rglwidget is required.
######################################//// 
######The basics######################////
######################################////
#1) Create a "food" network (three layers) 
set.seed(432)
g1<-watts.strogatz.game(dim = 1,size = 5,nei = 2,p = .5,loops = FALSE,multiple = FALSE)
g2<-watts.strogatz.game(dim = 1,size = 10,nei = 2,p = .2,loops = FALSE,multiple = FALSE)
g3<-watts.strogatz.game(dim = 1,size = 30,nei = 1,p = .5,loops = FALSE,multiple = FALSE)
g123=g1+g2+g3 


#Create more edges btw layers 
g123=rewire(g123,each_edge(prob=.4,loops = FALSE,multiple = FALSE)) 
ne=15;add_edges(g123,edges = cbind(sample(1:vcount(g1),size = ne,replace = TRUE), sample((vcount(g1)+1):vcount(g123),size = ne,replace = TRUE)))#top layer 
ne=30;add_edges(g123,edges = cbind(sample((vcount(g1)+1):(vcount(g1)+vcount(g2)),size = ne,replace = TRUE), sample((vcount(g1)+vcount(g2)+1):vcount(g123),size = ne,replace = TRUE)))#second layer 

#A quick plot of the graph
plot(g123,vertex.size=1,vertex.label.cex=0.02)


#Create 3d coordinates of the network layout
circpos=function(n,r=1){#Coordinates on a circle
 rad=seq(0,2*pi,length.out=n+1)[-1];x=cos(rad)*r;y=sin(rad)*r
 return(cbind(x,y))
}
#
lay=rbind(cbind(circpos(vcount(g1),r=1), runif(n = vcount(g1),-1,1)),
 cbind(circpos(vcount(g2),r=2), runif(n = vcount(g2),6,7)),
 cbind(circpos(vcount(g3),r=4), runif(n = vcount(g3),13,17))
)



#2d plot using the previous layout
plot(g123,vertex.size=5,vertex.label=NA,layout=lay[,c(1,3)])
plot(g123,vertex.size=1,vertex.label=NA,layout=lay[,c(1,2)])

layers

 
#3D graph plot
#Add some colour to nodes and edges
nodecols=c(rep("red",vcount(g1)),
 rep("blue",vcount(g2)),
 rep("yellow",vcount(g3)))

edgecols=function(elist,cols,grouplist){
 whatcol=rep(length(cols)+1,nrow(elist))
 finalcol=whatcol
 for(i in 1:nrow(elist)){
 for(k in length(cols):1){ 
 if( k * (length( intersect(elist[i,], grouplist[[k]]) ) > 0)){
 whatcol[i]=min(whatcol[i], k )
 }
 }
 finalcol[i]=cols[whatcol[i]]
 }
 return(finalcol)
}

#Open 3d viewer
rgl.open()
rglplot(g123, layout=lay,vertex.size=5,vertex.label=NA,vertex.color=nodecols,
 edge.color=edgecols(elist=get.edgelist(g123,names = FALSE),cols=c("orange","green","pink"),grouplist=list(1:vcount(g1), (vcount(g1)+1):(vcount(g1)+vcount(g2)), (vcount(g1)+vcount(g2)+1):vcount(g123)) )
)

3d_layers

###Storing the plot in an html file###

dirfolder="..." #your dir
#rgl.open()#instead of rgl.open use open3d, in order to save the plot. 
open3d()
rglplot(g123, layout=lay,vertex.size=5,vertex.label=NA,vertex.color=nodecols,
 edge.color=edgecols(elist=get.edgelist(g123,names = FALSE),cols=c("orange","green","pink"),grouplist=list(1:vcount(g1), (vcount(g1)+1):(vcount(g1)+vcount(g2)), (vcount(g1)+vcount(g2)+1):vcount(g123)) )
)
#Fix the view
rgl.viewpoint(theta=90, phi=0)

#Save a static 2d image:
rgl.snapshot(paste(dirfolder,"a_png_pic_0.png",sep=""), fmt="png", top=TRUE)

#Save the plot in a .htlm file:
rglfolder=writeWebGL(dir = paste(dirfolder,"first_net3d",sep=""), width=700)

#The previous function should create a file called index.htlm inside the folder "first_net3d". By opening this file in a browser (with javascript enabled) the 3d plot will be displayed again.
#Also the following command will open the plot in the browser:
browseURL(rglfolder)

Note 2: In order to view the .htlm file javascript should be enabled in the browser. (Here is an example on how to do this for safari ).

Although not covered in the previous script, further options are available such as edge/vertex size and the ability to control independently each of the nodes and edges in the graph. Here is an example that makes more use of these options:

clour_plot

3d network representing a T cell receptor. Edges are coloured according to a relevant path found between the bottom green node and the upper red node cluster.

T cell receptor (in blue), binding to a peptide (in red).

T cell receptor (in blue), binding to a peptide (in red).

Counting Threads

When someone talks about “counting threads” the first thing that you think of is probably shopping for bed sheets. But this post is not about the happy feeling of drifting off to sleep on smooth, comfortable Egyptian cotton. This post is about that much less happy feeling when you want to quickly run a bit of code on a couple of data sets to finish the results section of a thesis chapter, and you see this:

Luis blocking the server again....

Luis blocking the server again….

Obviously someone (*cough*Luis*cough*) is having some fun on the server without nice-ing their code to allow people who are much less organized than they should be (*cough*me*cough*) to do a quick last-minute data analysis run.

The solution: confront the culprit with their excessive server usage (Note: alternatives include manual server restart with a power cord to make the world your enemy – not recommended).

So… now we just need to find out how much of the server “ospina” is using. Screenshots won’t convince him… and we can’t take enough screenshots to show the extent of the server-hogging with his 1000s of processes anyway. We need to count…

Luckily there is a handy function to find out information about processes called pgrep. This is basically a ‘ps | grep’ function which has a bunch of options to reflect the many ways it can be used. We see opsina is running R, so here goes:

pgrep -c R

The -c flag counts processes and the pattern matches the command name that was run. But yeah, it turns out this wasn’t the best idea ever. A lot of people are running R (as might be expected in the Statistics Department), and you get a number that is really too high to be likely. We need to be more specific in our query, so let’s go back to the ps command. Second attempt:

ps -Af | grep ospina | wc -l

What we’re doing now is first showing all processes that are run on the server (ps -A) also showing details of the command run and who ran it (-f flag). Then we’re finding the ones that are labelled with our server culprit (grep ospina) and counting the lines we find. There are annoyingly still a few problems with this approach.

  1. We just ran this command on the server and thus will count a command like grep –color=auto ospina,
  2. User “ospina” is probably running a few more things than just his R command (like ssh-ing into the server and maybe a couple of screens)
  3. We get a number than looks far lower than what we expected just by visual inspection.

So… what happened? We can fix problems 1 and 2 by just piping to a further grep command. But problem 3 is different. As it turns out, our culprit is running multiple threads from the same process (which is also why you find so many chrome instances on htop for example). We just counted processes, when really the server is being occupied by his multi-threading exploits. So… if you want to back up your complaint with a nice number, here’s your baby:

ps -ALf | grep opsina | grep R-3.3 | wc -l

The -L flag displays all threads instead of only the processes. I further used R-3.3 as it turns out he is using a specific version of R, which I can use to specify this command. Otherwise it also helps to use inputs arguments to functions to search against. If your fingers get too tired to press the shift-key that often, ps -ALf is equivalent to ps -eLf.

For now: moan away, folks!

 

Disclaimer: Any scenarios alluded to in the above text are fictitious and do not represent the behaviour of the individuals mentioned. Read: obviously I do not do my analysis last-minute.