Using Random Forests in Python with Scikit-Learn | Oxford Protein Informatics Group

I spend a lot of time experimenting with machine learning tools in my research; in particular I seem to spend a lot of time chasing data into random forests and watching the other side to see what comes out. In my many hours of Googling “random forest foobar” a disproportionate number of hits offer solutions implemented in R. As a young Pythonista in the present year I find this a thoroughly unacceptable state of affairs, so I decided to write a crash course in how to build random forest models in Python using the machine learning library scikit-learn (or sklearn to friends). This is far from exhaustive, and I won’t be delving into the machinery of how and why we might want to use a random forest. Rather, the hope is that this will be useful to anyone looking for a hands-on introduction to random forests (or machine learning in general) in Python.

In the future I’ll write a more in-depth post on how a few libraries turn Python into a powerful environment for data handling and machine learning. Until then, though, let’s jump into random forests!

Toy datasets

Sklearn comes with several nicely formatted real-world toy data sets which we can use to experiment with the tools at our disposal. We’ll be using the venerable iris dataset for classification and the Boston housing set for regression. Sklearn comes with a nice selection of data sets and tools for generating synthetic data, all of which are well-documented. Now, let’s write some Python!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
iris = datasets.load_iris()

Classification using random forests

First we’ll look at how to do solve a simple classification problem using a random forest. The iris dataset is probably the most widely-used example for this problem and nicely illustrates the problem of classification when some classes are not linearly separable from the others.

First we’ll load the iris dataset into a pandas dataframe. Pandas is a nifty Python library which provides a data structure comparable to the dataframes found in R with database style querying. As an added bonus, the seaborn visualization library integrates nicely with pandas allowing us to generate a nice scatter matrix of our data with minimal fuss.

df = pd.DataFrame(iris.data, columns=iris.feature_names)

# sklearn provides the iris species as integer values since this is required for classification
# here we're just adding a column with the species names to the dataframe for visualisation
df['species'] = np.array([iris.target_names[i] for i in iris.target])

sns.pairplot(df, hue='species')

Neat. Notice that iris-setosa is easily identifiable by petal length and petal width, while the other two species are much more difficult to distinguish. We could do all sorts of pre-processing and exploratory analysis at this stage, but since this is such a simple dataset let’s just fire on. We’ll do a bit of pre-processing later when we come to the Boston data set.

First, let’s split the data into training and test sets. We’ll used stratified sampling by iris class to ensure both the training and test sets contain a balanced number of representatives of each of the three classes. Sklearn requires that all features and targets be numeric, so the three classes are represented as integers (0, 1, 2). Here we’re doing a simple 50/50 split because the data are so nicely behaved. Typically however we might use a 75/25 or even 80/20 training/test split to ensure we have enough training data. In true Python style this is a one-liner.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[iris.feature_names], iris.target, test_size=0.5, stratify=iris.target, random_state=123456)

Now let’s fit a random forest classifier to our training set. For the most part we’ll use the default settings since they’re quite robust. One exception is the out-of-bag estimate: by default an out-of-bag error estimate is not computed, so we need to tell the classifier object that we want this.

If you’re used to the R implementation, or you ever find yourself having to compare results using the two, be aware that some parameter names and default settings are different between the two. Fortunately both have excellent documentation so it’s easy to ensure you’re using the right parameters if you ever need to compare models.

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123456)
rf.fit(X_train, y_train)

Let’s see how well our model performs when classifying our unseen test data. For a random forest classifier, the out-of-bag score computed by sklearn is an estimate of the classification accuracy we might expect to observe on new data. We’ll compare this to the actual score obtained on our test data.

from sklearn.metrics import accuracy_score

predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)

print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')

Out-of-bag score estimate: 0.973
Mean accuracy score: 0.933

Not bad. However, this doesn’t really tell us anything about where we’re doing well. A useful technique for visualising performance is the confusion matrix. This is simply a matrix whose diagonal values are true positive counts, while off-diagonal values are false positive and false negative counts for each class against the other.

from sklearn.metrics import confusion_matrix

cm = pd.DataFrame(confusion_matrix(y_test, predicted), columns=iris.target_names, index=iris.target_names)
sns.heatmap(cm, annot=True)

This lets us know that our model correctly separates the setosa examples, but exhibits a small amount of confusion when attempting to distinguish between versicolor and virginica.

Random forest regression

Now let’s look at using a random forest to solve a regression problem. The Boston housing data set consists of census housing price data in the region of Boston, Massachusetts, together with a series of values quantifying various properties of the local area such as crime rate, air pollution, and student-teacher ratio in schools. The question for us is whether we can use these data to accurately predict median house prices. One caveat of this data set is that the median house price is truncated at $50,000 which suggests that there may be considerable noise in this region of the data. You might want to remove all data with a median house price of $50,000 from the set and see if the regression improves at all.

As before we’ll load the data into a pandas dataframe. This time, however, we’re going to do some pre-processing of our data by independently transforming each feature to have zero mean and unit variance. The values of different features vary greatly in order of magnitude. If we were to analyse the raw data as-is, we run the risk of our analysis being skewed by certain features dominating the variance. This isn’t strictly necessary for a random forest, but will enable us to perform a more meaningful principal component analysis later. Performing this transformation in sklearn is super simple using the StandardScaler class of the preprocessing module. This time we’re going to use an 80/20 split of our data. You could bin the house prices to perform stratified sampling, but we won’t worry about that for now.

boston = datasets.load_boston()

features = pd.DataFrame(boston.data, columns=boston.feature_names)
targets = boston.target

As before, we’ve loaded our data into a pandas dataframe. Notice how I have to construct new dataframes from the transformed data. This is because sklearn is built around numpy arrays. While it’s possible to return a view of a dataframe as an array, transforming the contents of a dataframe requires a little more work. Of course, there’s a library for that, but I’m lazy so I didn’t use it this time.

from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(features, targets, train_size=0.8, random_state=42)

scaler = StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(scaler.transform(X_train), index=X_train.index.values, columns=X_train.columns.values)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), index=X_test.index.values, columns=X_test.columns.values)

With the data standardised, let’s do a quick principal-component analysis to see if we could reduce the dimensionality of the problem. This is quick and easy in sklearn using the PCA class of the decomposition module.

from sklearn.decomposition import PCA

pca = PCA()
pca.fit(X_train)
cpts = pd.DataFrame(pca.transform(X_train))
x_axis = np.arange(1, pca.n_components_+1)
pca_scaled = PCA()
pca_scaled.fit(X_train_scaled)
cpts_scaled = pd.DataFrame(pca.transform(X_train_scaled))

# matplotlib boilerplate goes here

Notice how without data standardisation the variance is completely dominated by the first principal component. With standardisation, however, we see that in fact we must consider multiple features in order to explain a significant proportion of the variance. You might want to experiment with building regression models using the principal components (or indeed just combinations of the raw features) to see how well you can do with less information. For now though we’re going to use all of the (scaled) features as the regressors for our model. As with the classification problem fitting the random forest is simple using the RandomForestRegressor class.

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=500, oob_score=True, random_state=0)
rf.fit(X_train, y_train)

Now let’s see how we do on our test set. As before we’ll compare the out-of-bag estimate (this time it’s an R-squared score) to the R-squared score for our predictions. We’ll also compute Spearman rank and Pearson correlation coefficients for our predictions to get a feel for how we’re doing.

from sklearn.metrics import r2_score
from scipy.stats import spearmanr, pearsonr

predicted_train = rf.predict(X_train)
predicted_test = rf.predict(X_test)

test_score = r2_score(y_test, predicted_test)
spearman = spearmanr(y_test, predicted_test)
pearson = pearsonr(y_test, predicted_test)

print(f'Out-of-bag R-2 score estimate: {rf.oob_score_:>5.3}')
print(f'Test data R-2 score: {test_score:>5.3}')
print(f'Test data Spearman correlation: {spearman[0]:.3}')
print(f'Test data Pearson correlation: {pearson[0]:.3}')

Out-of-bag R-2 score estimate: 0.841
Test data R-2 score: 0.886
Test data Spearman correlation: 0.904
Test data Pearson correlation: 0.942

Not too bad, though there are a few outliers that would be worth looking into. Your challenge, should you choose to accept it, is to see if removing the $50,000 data improves the regression.

Wrapping up

Congratulations on making it this far. Now you know how to pre-process your data and build random forest models all from the comfort of your iPython session. I plan on writing more in the future about how to use Python for machine learning, and in particular how to make use of some of the powerful tools available in sklearn (a pipeline for data preparation, model fitting, prediction, in one line of Python? Yes please!), and how to make sklearn and pandas play nicely with minimal hassle. If you’re lucky, and if I can bring myself to process the data nicely, I might include some fun examples from less well-behaved real-world data sets.

Until then, though, happy Pythoning!

Author

Fergus Boyles

View all posts