R or Python for data vis? | Oxford Protein Informatics Group

Python users: ever wanted to learn R?
R users: ever wanted to learn Python?
Check out: http://mathesaurus.sourceforge.net/r-numpy.html

Both languages are incredibly powerful for doing large-scale data analyses. They both have amazing data visualisation platforms, allowing you to make custom graphs very easily (e.g. with your own set of fonts, color palette choices, etc.) These are just a quick run-down of the good, bad, and ugly:

The good:
- More established in statistical analyses; if you can’t find an R package for something, chances are it won’t be available in Python either.
- Data frame parsing is fast and efficient, and incredibly easy to use (e.g. indexing specific rows, which is surprisingly hard in Pandas)
- If GUIs are your thing, there are programs like Rstudio that mesh the console, plotting, and code.
The bad:
- For loops are traditionally slow, meaning that you have to use lots of apply commands (e.g. tapply, sapply).
The ugly:
- Help documentation can be challenging to read and follow, leading to (potentially) a steep learning curve.

Python

The good:
- If you have existing code in Python (e.g. analysing protein sequences/structures), then you can plot straight away without having to save it as a separate CSV file for analysis, etc.
- Lots of support for different packages such as NumPy, SciPy, Scikit Learn, etc., with good documentation and lots of help on forums (e.g. Stack Overflow)
- It’s more useful for string manipulation (e.g. parsing out the ordering of IMGT numbering for antibodies, which goes from 111A->111B->112B->112A->112)
The bad:
- Matplotlib, which is the go-to for data visualisation, has a pretty steep learning curve.
The ugly:
- For statistical analyses, model building can have an unusual syntax. For example, building a linear model in R is incredibly easy (lm), whereas Python involves sklearn.linear_model.LinearRegression().fit. Otherwise you have to code up a lot of things yourself, which might not be practical.

For me, Python wins because I find it’s much easier to create an analysis pipeline where you can go from raw data (e.g. PDB structures) to analysing it (e.g. with BioPython) then plotting custom graphics. Another big selling point is that Python packages have great documentation. Of course, there are libraries to do the analyses in R but the level of freedom, I find, is a bit more restricted, and R’s documentation means you’re often stuck interpreting what the package vignette is saying, rather than doing actual coding.

As for plotting (because pretty graphs are where it’s at!), here’s a very simple implementation of plotting the densities of two normal distributions, along with their means and standard deviations.

import numpy as np
from matplotlib import rcParams

# plt.style.use('xkcd') # A cool feature of matplotlib is stylesheets, e.g. make your plots look XKCD-like

# change font to Arial
# you can change this to any TrueType font that you have in your machine
rcParams['font.family'] = 'sans-serif'
rcParams['font.sans-serif'] = ['Arial']

import matplotlib.pyplot as plt
# Generate two sets of numbers from a normal distribution
# one with mean = 4 sd = 0.5, another with mean (loc) = 1 and sd (scale) = 2
randomSet = np.random.normal(loc = 4, scale = 0.5, size = 1000)
anotherRandom = np.random.normal(loc = 1, scale = 2, size = 1000)

# Define a Figure and Axes object using plt.subplots
# Axes object is where we do the actual plotting (i.e. draw the histogram)
# Figure object is used to configure the actual figure (e.g. the dimensions of the figure)
fig, ax = plt.subplots()

# Plot a histogram with custom-defined bins, with a blue colour, transparency of 0.4
# Plot the density rather than the raw count using normed = True
ax.hist(randomSet, bins = np.arange(-3, 6, 0.5), color = '#134a8e', alpha = 0.4, normed = True)
ax.hist(anotherRandom, bins = np.arange(-3, 6, 0.5), color = '#e8291c', alpha = 0.4, normed = True)

# Plot solid lines for the means
plt.axvline(np.mean(randomSet), color = 'blue')
plt.axvline(np.mean(anotherRandom), color = 'red')

# Plot dotted lines for the std devs
plt.axvline(np.mean(randomSet) - np.std(randomSet), linestyle = '--', color = 'blue')
plt.axvline(np.mean(randomSet) + np.std(randomSet), linestyle = '--', color = 'blue')

plt.axvline(np.mean(anotherRandom) - np.std(anotherRandom), linestyle = '--', color = 'red')
plt.axvline(np.mean(anotherRandom) + np.std(anotherRandom), linestyle = '--', color = 'red')

# Set the title, x- and y-axis labels
plt.title('A fancy plot')
ax.set_xlabel("Value of $x$") 
ax.set_ylabel("Density")

# Set the Figure's size as a 5in x 5in figure
fig.set_size_inches((5,5))

Figure made by matplotlib using the code above.

randomSet = rnorm(mean = 4, sd = 0.5, n = 1000)
anotherRandom = rnorm(mean = 1, sd = 2, n = 1000)

# Let's define a range to plot the histogram for binning;
limits = range(randomSet, anotherRandom)
lbound = limits[1] - (diff(limits) * 0.1)
ubound = limits[2] + (diff(limits) * 0.1)
# use freq = F to plot density
# in breaks, we define the bins of the histogram by providing a vector of values using seq
# xlab, ylab define axis labels; main sets the title
# rgb defines the colour in RGB values from 0-1, with the fourth digit setting transparency
# e.g. rgb(0,1,0,1) is R = 0, G = 1, B = 0, with a alpha of 1 (i.e. not transparent)
hist(randomSet, freq = F, breaks = seq(lbound, ubound, 0.5), col = rgb(0,0,1,0.4), xlab = 'Value of x', ylab = 'Density', main = 'A fancy plot')
# Use add = T to keep both histograms in one graph
# other parameters, such as breaks, etc., can be introduced here
hist(anotherRandom, freq = F, breaks = seq(lbound, ubound, 0.5), col = rgb(1,0,0,0.4), add = T)

# Plot vertical lines with v =
# lty = 2 generates a dashed line
abline(v = c(mean(randomSet), mean(anotherRandom)), col = c('blue', 'red'))

abline(v = c(mean(randomSet)-sd(randomSet), mean(randomSet)+sd(randomSet)), col = 'blue', lty = 2)
abline(v = c(mean(anotherRandom)-sd(anotherRandom), mean(anotherRandom)+sd(anotherRandom)), col = 'red', lty = 2)

Similar figure made using R code from above.

*Special thanks go out to Ali and Lyuba for helpful fixes to make the R code more efficient!

Author

Jinwoo Leem

View all posts