Monthly Archives: April 2017

Bitbucket and PyCharm – Tools to make a DPhil less problematic

I find Git a wonderful tool for my work, with version control providing much needed damage control to my projects. I also find Git incredibly powerful at making my working life easier, with the ability to use git push and git pull to synchronise my code between the various computers that I use for my DPhil. Via a BitBucket account, providing a remote Git repository, I am able to move my code around to wherever I am working and allow more room for either more procrastination or staring at my screen in confusion.

As simple as GIT is, it can be a fiddle entering the git commands in command line as well as remembering to do this as you rush to leave the building. This has all been made much easier with PyCharm, from JetBrains. This IDE (integrated development environment) has many tools including version control such as support for a variety of file types, PEP8 checks to ensure good quality code and its ability to work with ipython notebooks.

I’ve put the following mini-tutorial together for those who want to make or bring in an existing repository to PyCharm and get version control working:

Continue reading →

Colour page counter

So you’ve written the thesis, you’ve been examined, the corrections are done, and now you are left with just wearing the silly clown robes to get a piece of paper with your name on it. However, you’ve been informed that you aren’t allowed to don the silly robes until you print the damn thing (again) and submit it to the Bod to be ignored for generations to come. Oh, and the added bonus is that you have to pay for it. Naturally, you want the high-quality printing and paper to match for the final versions, but it’s all so expensive. At least you can save a few meagre pounds by specifying only the pages for colour printing. Naturally, I decided that I would spend far more time making a script than just counting them myself (which I did anyway to verify it works). Enjoy.

#!/usr/bin/env Rscript

library(data.table)

args <- commandArgs(trailingOnly=TRUE) x <- system(paste("gs -q -o - -sDEVICE=inkcov",args[1]," | awk '{print $1,$2,$3}'"),intern=TRUE) x <- as.data.table(tstrsplit(x,' ')) x[,c("V1","V2","V3"):=.(as.numeric(V1),as.numeric(V2),as.numeric(V3))] print(paste("Colour pages total:",sum(rowSums(x)!=0))) print(paste("Colour pages:", paste(which(rowSums(x) != 0),collapse=', ')))

Faster FREAD with Pandas

One of the things I like to do is to scale up things using the ridiculous amount of cores at my disposal (sometimes even for a good reason). One of these examples is when I had to model millions of CDRs (or loops) using FREAD.

The process through which you model a loop in Fread is:

Pre-filtering step: Anchor Ca separation and ESST score in between your target and all the templates in the DB. The ones that pass a threshold are saved for step 2
Anchor RMSD test

The major bottleneck for such an analysis is step 1, where most of the templates are filtered out so for step 2 you get a very reduced subset. The data for doing the Anchor Ca separation and ESST score is all stored for each possible template in one row of an sqlite database. So when you do step 1 you will go through each row of this table and calculate the score, with the database is stored on the hard drive so costly I/O. This is fine for the original purpose of Fread, where you filled in a missing loop for one structure, but when you are doing it for 100 million examples, going through a table stored on a hard drive 100 million times, sequentially, it is going to be SLOW. I say sequentially because for the python implementation using sqlite3 I had a lot of trouble trying to use a db handle on multiple threads, or load the same sql file on different instances on threads, it just crashes for no good reason. There has been chat about this on stackoverflow and I think this has moved on since I implemented it in 2015. Nevertheless, I wanted a simple and clean solution.

I decided to transform the sqlite3 database into a Pandas object. Pandas objects are basically a convenient way of storing tables with methods available that mimick conventional querying mechanisms for databases. These are stored in memory, easily dumped as pickle files, and can be easily duplicated between threads so no issues with thread safety. Obviously you need to have enough memory to store all of that, but for my application that was not a problem. Below is some sample code on how I used it to transform the template DB from FREAD.

import pandas as pd
import sqlite3 as sql

rows = []

# connect to your fread sql file
conn = sql.connect("fread_sql_file.sql")
try:
    query = "SELECT dihedral, sequence, pdbcode, start, anchor, bound FROM loops"
    results = conn.execute(query)
    for row in results:
        # store the rows as a list of dictionaries
        rows.append(dict(zip(['dihedral', 'length', 'pdbcode', 'anchor', 'sequence', 'start', 'bound'], [row[0], len(row[1]), row[2], row[4] ,row[1], row[3], row[5]])))
        
except Exception as e:
    print "Error during query", str(e)
    conn.close()

# create a pandas dataframe from the list of dictionaries 
df = pd.DataFrame(rows)
# store the table as a pickle file which you can reload later (this is very fast!)
df.to_pickle("fread_pandas_file.pickle")

After running this you will have your sql database as a pandas dataframe, and you can write methods which are thread safe to model loops as below:

import pandas as pd

THRESHOLD = 25
cdr_db pd.read_pickle("fread_pandas_file.sqlite")


def model_loop(query_sequence, query_anchors_ca):
    # score_sequence_db_helper is your function that attaches a scores based on your query sequence and the row in the template db
    scores = cdr_db.apply(lambda row: score_sequence_db_helper(row, query_sequence, query_anchors_ca), axis=1)

    # attach the score
    results = zip(list(cdr_db['pdbcode']), scores, list(cdr_db['sequence']))

    # keep the ones that are over the threshold
    results = filter(lambda (pdb_code, score, sequence): score>=THRESHOLD, results)
    
    return results

A very basic introduction to Random Forests using R

Random Forests is a powerful tool used extensively across a multitude of fields. As a matter of fact, it is hard to come upon a data scientist that never had to resort to this technique at some point. Motivated by the fact that I have been using Random Forests quite a lot recently, I decided to give a quick intro to Random Forests using R.

So what are Random Forests? Well, I am probably not the most suited person to answer this question (a google search will reveal much more interesting answers) , still I shall give it a go. Random Forests is a learning method for classification (and others applications — see below). It is based on generating a large number of decision trees, each constructed using a different subset of your training set. These subsets are usually selected by sampling at random and with replacement from the original data set. The decision trees are then used to identify a classification consensus by selecting the most common output (mode). While random forests can be used for other applications (i.e. regression), for the sake of keeping this post short, I shall focus solely on classification.

Why R? Well, the quick and easy question for this is that I do all my plotting in R (mostly because I think ggplot2 looks very pretty). I decided to explore Random Forests in R and to assess what are its advantages and shortcomings. I am planning to compare Random Forests in R against the python implementation in scikit-learn. Do expect a post about this in the near future!

The data: to keep things simple, I decided to use the Edgar Anderson’s Iris Data set. You can have a look at it by inspecting the contents of iris in R. This data set contains observations for four features (sepal length and width, and petal length and width – all in cm) of 150 flowers, equally split between three different iris species. This data set is fairly canon in classification and data analysis. Let us take a look at it, shall we:

As you can observe, there seems to be some separation in regards to the different features and our three species of irises [note: this set is not very representative of a real world data set and results should be taken with a grain of salt].

Training and Validation sets: great care needs to be taken to ensure clear separation between training and validation sets. I tend to save the cases for which I am actually interested in performing predictions as a second validation set (Validation 2). Then I split the remaining data evenly into Training and Validation 1.

Let us split our data set then, shall we?

# Set random seed to make results reproducible:
set.seed(17)
# Calculate the size of each of the data sets:
data_set_size <- floor(nrow(iris)/2)
# Generate a random sample of "data_set_size" indexes
indexes <- sample(1:nrow(iris), size = data_set_size)

# Assign the data to the correct sets
training <- iris[indexes,]
validation1 <- iris[-indexes,]

Before we can move on, here are some things to consider:

1- The size of your data set usually imposes a hard limit on how many features you can consider. This occurs due to the curse of dimensionality, i.e. your data becomes sparser and sparser as you increase the number of features considered, which usually leads to overfitting. While there is no rule of thumb relating to how many features vs. the number of observations you should use, I try to keep e^Nf < No (Nf = number of features, No = number of observations) to minimise overfitting [this is not always possible and it does not ensure that we won’t overfit]. In this case, our training set has 75 observations, which suggests that using four features (e^4 ~ 54.6) is not entirely absurd. Obviously, this depends on your data, so we will cover some further overfitting checks later on.

2- An important thing to consider when assembling training sets is the proportion of negatives vs. positives in your data. Think of an extreme scenario where you have many, many more observations for one class vs. the others. How will this affect classification? This would make it more likely for the classifier to predict the dominant class when given new values. I mentioned before that the iris set is quite nice to play with. It comes with exactly 50 observations for each species of irises. What happens if you have a data set with a much higher number of observations for a particular class? You can bypass any imbalance regarding the representation of each class by carefully constructing your training set in order not to favour any particular class. In this case, our randomly selected set has 21 observations for species setosa and 27 observations for each of species versicolor and virginica, so we are good to go.

3- Another common occurrence that is not represented by the iris data set is missing values (NAs) for observations. There are many ways of dealing with missing values, including assigning the median or the mode for that particular feature to the missing observation or even disregarding some observations entirely, depending on how many observations you have. There are even ways to use random forests to estimate a good value to assign to the missing observations, but for the sake of brevity, this will not be covered here.

Right, data sets prepared and no missing values, it is time to fire our random forests algorithm. I am using the randomForest package. You can click the link for additional documentation. Here is the example usage code:

#import the package
library(randomForest)
# Perform training:
rf_classifier = randomForest(Species ~ ., data=training, ntree=100, mtry=2, importance=TRUE)

Note some important parameters:

-The first parameter specifies our formula: Species ~ . (we want to predict Species using each of the remaining columns of data).
–ntree defines the number of trees to be generated. It is typical to test a range of values for this parameter (i.e. 100,200,300,400,500) and choose the one that minimises the OOB estimate of error rate.
–mtry is the number of features used in the construction of each tree. These features are selected at random, which is where the “random” in “random forests” comes from. The default value for this parameter, when performing classification, is sqrt(number of features).
–importance enables the algorithm to calculate variable importance.

We can quickly look at the results of our classifier for our training set by printing the contents of rf_classifier:

> rf_classifier

Call:
 randomForest(formula = Species ~ ., data = training,ntree=100,mtry=2, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 2

        OOB estimate of  error rate: 5.33%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         21          0         0  0.00000000
versicolor      0         25         2  0.07407407
virginica       0          2        25  0.07407407

As you can see, it lists the call used to build the classifier, the number of trees (100), the variables at each split (2), and it outputs a very useful confusion matrix and OOB estimate of error rate. This estimate is calculated by counting however many points in the training set were misclassified (2 versicolor and 2 virginica observations = 4) and dividing this number by the total number of observations (4/75 ~= 5.33%).

The OOB estimate of error rate is a useful measure to discriminate between different random forest classifiers. We could, for instance, vary the number of trees or the number of variables to be considered, and select the combination that produces the smallest value for this error rate. For more complicated data sets, i.e. when a higher number of features is present, a good idea is to use cross-validation to perform feature selection using the OOB error rate (see rfcv from randomForest for more details).

Remember the importance parameter? Let us take a look at the importance that our classifier has assigned to each variable:

varImpPlot(rf_classifier)

Each features’s importance is assessed based on two criteria:

-MeanDecreaseAccuracy: gives a rough estimate of the loss in prediction performance when that particular variable is omitted from the training set. Caveat: if two variables are somewhat redundant, then omitting one of them may not lead to massive gains in prediction performance, but would make the second variable more important.

-MeanDecreaseGini: GINI is a measure of node impurity. Think of it like this, if you use this feature to split the data, how pure will the nodes be? Highest purity means that each node contains only elements of a single class. Assessing the decrease in GINI when that feature is omitted leads to an understanding of how important that feature is to split the data correctly.

Do note that these measures are used to rank variables in terms of importance and, thus, their absolute values could be disregarded.

Ok, great. Looks like we have a classifier that was properly trained and is producing somewhat good predictions for our training set. Shall we evaluate what happens when we try to use this classifier to predict classes for our validation1 set?

# Validation set assessment #1: looking at confusion matrix
prediction_for_table <- predict(rf_classifier,validation1[,-5])
table(observed=validation1[,5],predicted=prediction_for_table)

            predicted
observed     setosa versicolor virginica
  setosa         29          0         0
  versicolor      0         20         3
  virginica       0          1        22

The confusion matrix is a good way of looking at how good our classifier is performing when presented with new data.

Another way of assessing the performance of our classifier is to generate a ROC curve and compute the area under the curve:

# Validation set assessment #2: ROC curves and AUC

# Needs to import ROCR package for ROC curve plotting:
library(ROCR)

# Calculate the probability of new observations belonging to each class
# prediction_for_roc_curve will be a matrix with dimensions data_set_size x number_of_classes
prediction_for_roc_curve <- predict(rf_classifier,validation1[,-5],type="prob")

# Use pretty colours:
pretty_colours <- c("#F8766D","#00BA38","#619CFF")
# Specify the different classes 
classes <- levels(validation1$Species)
# For each class
for (i in 1:3)
{
 # Define which observations belong to class[i]
 true_values <- ifelse(validation1[,5]==classes[i],1,0)
 # Assess the performance of classifier for class[i]
 pred <- prediction(prediction_for_roc_curve[,i],true_values)
 perf <- performance(pred, "tpr", "fpr")
 if (i==1)
 {
     plot(perf,main="ROC Curve",col=pretty_colours[i]) 
 }
 else
 {
     plot(perf,main="ROC Curve",col=pretty_colours[i],add=TRUE) 
 }
 # Calculate the AUC and print it to screen
 auc.perf <- performance(pred, measure = "auc")
 print(auc.perf@y.values)
}

Here is the final product (ROC curve):

And here are the values for our AUCs:

Setosa
AUC = 1

Versicolor
AUC = 0.98

Virginica
AUC = 0.98

Voila! I hope this was somewhat useful!

Oxford Protein Informatics Group

or "OPIG" to friends

Monthly Archives: April 2017

Bitbucket and PyCharm – Tools to make a DPhil less problematic

Colour page counter

Faster FREAD with Pandas

A very basic introduction to Random Forests using R