Author Archives: Jinwoo Leem

I just wanted TensorFlow

Finally got TensorFlow to install on my Mac. You’d be tempted to think, “Jin, it’s just a pip install, surely?”

No, MacOS begs to differ! You see, if you’re on a slightly older macOS version like I was (10.12), then you’d still be using TLS 1.0 – long story short, when querying PyPI via pip to get any packages on TLS 1.0, your requests will get rejected. And this cutoff was chosen something like a week ago – SAD! If you have MacOS 10.13 and onward, TLS should be set to 1.2 so you need not worry.

TL;DR:

Get a new version of pip (10.0); see Stack Overflow post.
Install any dependencies for pip as necessary by doing tons of source compilations.
Install desired package(s) as necessary.

TCR Database

Back-to-back posting – I wanted to talk about the growing volume of TCR structures in the PDB. A couple of weeks ago, I presented my database to the group (STCRDab), which is now available at http://opig.stats.ox.ac.uk/webapps/stcrdab.

Unlike other databases, STCRDab is fully automated and updates on Fridays at 9AM (GMT), downloading new TCR structures and annotating them with the IMGT numbering (also applies for MHCs!). Although the size of the data is significantly smaller than, say, the number of antibody structures (currently at 3000+ structures and growing), the recent approval of CAR therapies (Kymriah, Yescarta), and the rise of interest in TCR engineering (e.g. Glanville et al., Nature, 2017; Dash et al., Nature, 2017) point toward the value of structures.

Feel free to read more in the paper, and here are some screenshots. 🙂

STCRDab front page.

Look! 5men, literally.

Possibly my new favourite PDB code.

STCRDab annotates structures automatically every Friday!

ABodyBuilder and model quality

Currently I’m working on developing a new strategy to use FREAD within the ABodyBuilder pipeline. While running some tests I’ve realised that some of the RMSD values that there were some minor miscalculations of CDR loops’ RMSD in my paper.

To start with, the main message of the paper remains the same; the overall quality of the models (Fv RMSD) was correct, and still is. ABodyBuilder isn’t necessarily the most accurate modelling methodology per se, but it’s unique in its ability to estimate RMSD. ABodyBuilder would still be capable of doing this calculation regardless of what the CDR loops’ RMSD may be. This is because the accuracy estimation looks at the RMSD data and places a probability that a new model structure would have some RMSD value “x” (given the CDR loop’s length). Our website has now been updated in light of these changes too.

Update to Figure 2 of the paper.

Update to Figure S4 of the paper.

Update to Figure S5 of the paper.

Typography in graphs.

Typography [tʌɪˈpɒɡrəfi]
n.: the style and appearance of printed matter.

Perhaps a “glossed” feature of making graphs, having the right font goes a long way. Not only do we have the advantage of using a “pretty” font that we like, it also provides an aesthetic satisfaction of having everything (e.g. in a PhD thesis) in the same font, i.e. both the text and graph use the same font.

Fonts can be divided into two types: serif and sans-serif. Basically, serif fonts are those where the letters have little “bits” at the end; think of Times New Roman or Garamond as the classic examples. Sans-serif fonts are those that lack these bits, and give it a more “blocky”, clean finish – think of Arial or Helvetica as a classic example.

Typically, serif fonts are better for books/printed materials, whereas sans-serif fonts are better for web/digital content. As it follows, then what about graphs? Especially those that may go out in the public domain (whether it’s through publishing, or in a web site)?

This largely bottles down to user preference, and choosing the right font is not trivial. Supposing that you have (say, from Google Fonts), then there are a few things we need to do (e.g. make sure that your TeX distribution and Illustrator have the font). However, this post is concerned with how we can use custom fonts in a graph generated by Matplotlib, and why this is useful. My favourite picks for fonts include Roboto and Palatino.

The default font in matplotlib isn’t the prettiest ( I think) for publication/keeping purposes, but I digress…

To start, let’s generate a histogram of 1000 random numbers from a normal distribution.

The default font in matplotlib, bitstream sans, isn’t the prettiest thing on earth. Does the job but it isn’t my go-to choice if I can change it. Plus, with lots of journals asking for Type 1/TrueType fonts for images, there’s even more reason to change this anyway (matplotlib, by default, generates graphs using Type 3 fonts!). If we now change to Roboto or Palatino, we get the following:

Sans-serif Roboto.

Serif font Palatino.

Basically, the bits we need to include at the beginning of our code are here:

# Need to import matplotlib options setting method
# Set PDF font types - not necessary but useful for publications
from matplotlib import rcParams
rcParams['pdf.fonttype'] = 42

# For sans-serif
from matplotlib import rc
rc("font", **{"sans-serif": ["Roboto"]}

# For serif - matplotlib uses sans-serif family fonts by default
# To render serif fonts, you also need to tell matplotlib to use LaTeX in the backend.
rc("font", **{"family": "serif", "serif": ["Palatino"]})
rc("text", usetex = True)

This not only guarantees that images are generated using a font of our choice, but it gives a Type 1/TrueType font too. Ace!

Happy plotting.

Using bare git repos

Git is a fantastic method of doing version control of your code. Whether it’s to share with collaborators, or just for your own reference, it almost acts as an absolute point of reference for a wide variety of applications and needs. The basic concept of git is that you have your own folder (in which you edit your code, etc.) and you commit/push those changes to a git repository. Note that Git is a version control SYSTEM and GitHub/BitBucket etc. are services that host repositories using Git as its backend!

The basic procedure of git can be summarised to:

1. Change/add/delete files in your current working directory as necessary. This is followed by a git add or git rm command.
2. “Commit” those changes; we usually put a message reflecting the change from step 1. e.g. git commit -m "I changed this file because it had a bug before."
3. You “push” those changes with git push to a git repository (e.g. hosted by BitBucket, GitHub, etc.); this is sort of like saying “save” that change.

Typically we use services like GitHub to HOST a repository. We then push our changes to that repository (or git pull from it) and all is good. However, a powerful concept to bear in mind is the ‘bare’ git repository. This is especially useful if you have code that’s private and should be strictly kept within your company/institution’s server, yet you don’t want people messing about too much with the master version of the code. The diagram below makes the bare git repository concept quite clear:

The bare repo acts as a “master” version of sorts, and every other “working”, or non-bare repo pushes/pulls changes out of it.

Let’s start with the easy stuff first. Every git repository (e.g. the one you’re working on in your machine) is a WORKING/NON-BARE git repository. This shows files in your code as you expect it, e.g. *.py or *.c files, etc. A BARE repository is a folder hosted by a server which only holds git OBJECTS. In it, you’ll never see a single .py or .c file, but a bunch of folders and text files that look nothing like your code. By the magic of git, this is easily translated as .py or .c files (basically a version of the working repo) when you git clone it. Since the bare repo doesn’t contain any of the actual code, you can safely assume that no one can really mess up with the master version without having gone through the process of git add/commit/push, making everything documented. To start a bare repo…

# Start up a bare repository in a server
user@server:$~  git init --bare name_to_repo.git

# Go back to your machine then clone it
user@machine:$~ git clone user@server:/path/to/repo/name_to_repo.git .

# This will clone a empty git repo in your machine
cd name_to_repo
ls
# Nothing should come up.

touch README
echo "Hello world" >> README
git add README
git commit -m "Adding a README to initialise the bare repo."
git push origin master # This pushes to your origin, which is user@server:/path/to/repo/name_to_repo.git

If we check our folders, we will see the following:

user@machine:$~ ls /path/to/repo
README # only the readme exists

user@server:$~ ls /path/to/repo/name_to_repo.git/
branches/ config description HEAD hooks/ info/ objects/ refs/

Magic! README isn’t in our git repo. Again, this is because the git repo is BARE and so the file we pushed won’t exist. But when we clone it in a different machine…

user@machine2:$~ git clone user@server:/path/to/repo/name_to_repo.git .
ls name_to_repo.git/
README
cat README
Hello world #magic!

This was a bit of a lightning tour but hopefully you can see that the purpose of a bare repo is to allow you to host code as a “master version” without having you worry that people will see the contents directly til they do a git clone. Once they clone, and push changes, everything will be documented via git, so you’ll know exactly what’s going on!

R or Python for data vis?

Python users: ever wanted to learn R?
R users: ever wanted to learn Python?
Check out: http://mathesaurus.sourceforge.net/r-numpy.html

Both languages are incredibly powerful for doing large-scale data analyses. They both have amazing data visualisation platforms, allowing you to make custom graphs very easily (e.g. with your own set of fonts, color palette choices, etc.) These are just a quick run-down of the good, bad, and ugly:

The good:
- More established in statistical analyses; if you can’t find an R package for something, chances are it won’t be available in Python either.
- Data frame parsing is fast and efficient, and incredibly easy to use (e.g. indexing specific rows, which is surprisingly hard in Pandas)
- If GUIs are your thing, there are programs like Rstudio that mesh the console, plotting, and code.
The bad:
- For loops are traditionally slow, meaning that you have to use lots of apply commands (e.g. tapply, sapply).
The ugly:
- Help documentation can be challenging to read and follow, leading to (potentially) a steep learning curve.

Python

The good:
- If you have existing code in Python (e.g. analysing protein sequences/structures), then you can plot straight away without having to save it as a separate CSV file for analysis, etc.
- Lots of support for different packages such as NumPy, SciPy, Scikit Learn, etc., with good documentation and lots of help on forums (e.g. Stack Overflow)
- It’s more useful for string manipulation (e.g. parsing out the ordering of IMGT numbering for antibodies, which goes from 111A->111B->112B->112A->112)
The bad:
- Matplotlib, which is the go-to for data visualisation, has a pretty steep learning curve.
The ugly:
- For statistical analyses, model building can have an unusual syntax. For example, building a linear model in R is incredibly easy (lm), whereas Python involves sklearn.linear_model.LinearRegression().fit. Otherwise you have to code up a lot of things yourself, which might not be practical.

For me, Python wins because I find it’s much easier to create an analysis pipeline where you can go from raw data (e.g. PDB structures) to analysing it (e.g. with BioPython) then plotting custom graphics. Another big selling point is that Python packages have great documentation. Of course, there are libraries to do the analyses in R but the level of freedom, I find, is a bit more restricted, and R’s documentation means you’re often stuck interpreting what the package vignette is saying, rather than doing actual coding.

As for plotting (because pretty graphs are where it’s at!), here’s a very simple implementation of plotting the densities of two normal distributions, along with their means and standard deviations.

import numpy as np
from matplotlib import rcParams

# plt.style.use('xkcd') # A cool feature of matplotlib is stylesheets, e.g. make your plots look XKCD-like

# change font to Arial
# you can change this to any TrueType font that you have in your machine
rcParams['font.family'] = 'sans-serif'
rcParams['font.sans-serif'] = ['Arial']

import matplotlib.pyplot as plt
# Generate two sets of numbers from a normal distribution
# one with mean = 4 sd = 0.5, another with mean (loc) = 1 and sd (scale) = 2
randomSet = np.random.normal(loc = 4, scale = 0.5, size = 1000)
anotherRandom = np.random.normal(loc = 1, scale = 2, size = 1000)

# Define a Figure and Axes object using plt.subplots
# Axes object is where we do the actual plotting (i.e. draw the histogram)
# Figure object is used to configure the actual figure (e.g. the dimensions of the figure)
fig, ax = plt.subplots()

# Plot a histogram with custom-defined bins, with a blue colour, transparency of 0.4
# Plot the density rather than the raw count using normed = True
ax.hist(randomSet, bins = np.arange(-3, 6, 0.5), color = '#134a8e', alpha = 0.4, normed = True)
ax.hist(anotherRandom, bins = np.arange(-3, 6, 0.5), color = '#e8291c', alpha = 0.4, normed = True)

# Plot solid lines for the means
plt.axvline(np.mean(randomSet), color = 'blue')
plt.axvline(np.mean(anotherRandom), color = 'red')

# Plot dotted lines for the std devs
plt.axvline(np.mean(randomSet) - np.std(randomSet), linestyle = '--', color = 'blue')
plt.axvline(np.mean(randomSet) + np.std(randomSet), linestyle = '--', color = 'blue')

plt.axvline(np.mean(anotherRandom) - np.std(anotherRandom), linestyle = '--', color = 'red')
plt.axvline(np.mean(anotherRandom) + np.std(anotherRandom), linestyle = '--', color = 'red')

# Set the title, x- and y-axis labels
plt.title('A fancy plot')
ax.set_xlabel("Value of $x$") 
ax.set_ylabel("Density")

# Set the Figure's size as a 5in x 5in figure
fig.set_size_inches((5,5))

Figure made by matplotlib using the code above.

randomSet = rnorm(mean = 4, sd = 0.5, n = 1000)
anotherRandom = rnorm(mean = 1, sd = 2, n = 1000)

# Let's define a range to plot the histogram for binning;
limits = range(randomSet, anotherRandom)
lbound = limits[1] - (diff(limits) * 0.1)
ubound = limits[2] + (diff(limits) * 0.1)
# use freq = F to plot density
# in breaks, we define the bins of the histogram by providing a vector of values using seq
# xlab, ylab define axis labels; main sets the title
# rgb defines the colour in RGB values from 0-1, with the fourth digit setting transparency
# e.g. rgb(0,1,0,1) is R = 0, G = 1, B = 0, with a alpha of 1 (i.e. not transparent)
hist(randomSet, freq = F, breaks = seq(lbound, ubound, 0.5), col = rgb(0,0,1,0.4), xlab = 'Value of x', ylab = 'Density', main = 'A fancy plot')
# Use add = T to keep both histograms in one graph
# other parameters, such as breaks, etc., can be introduced here
hist(anotherRandom, freq = F, breaks = seq(lbound, ubound, 0.5), col = rgb(1,0,0,0.4), add = T)

# Plot vertical lines with v =
# lty = 2 generates a dashed line
abline(v = c(mean(randomSet), mean(anotherRandom)), col = c('blue', 'red'))

abline(v = c(mean(randomSet)-sd(randomSet), mean(randomSet)+sd(randomSet)), col = 'blue', lty = 2)
abline(v = c(mean(anotherRandom)-sd(anotherRandom), mean(anotherRandom)+sd(anotherRandom)), col = 'red', lty = 2)

Similar figure made using R code from above.

*Special thanks go out to Ali and Lyuba for helpful fixes to make the R code more efficient!

Colour wisely…

Colour – the attribute of an image that makes it acceptable or destined for the bin. Colour has a funny effect on us – it’s a double-edged sword that greatly strengthens, or weakens data representation in such a huge level. No one really talks about what’s a good way to colour an image or a graph, but it’s something that most can agree as being pleasing, or disgusting. There are two distinctive advantages to colouring a graph: it conveys both quantitative and categorical information very, very well. Thus, I will provide a brief overview (with code) on how colour can be used to display both quantitative and qualitative information. (*On the note of colours, Nick has previously discussed how colourblindness must be considered in visualising data…).

1. Colour conveys quantitative information.
A huge advantage of colour is that it can provide quantitative information, but this has to be done correctly. Here are three graphs showing the exact same information (the joint density of two normal distributions) and we can see from the get-go which method is the best at representing the density of the two normal distributions:

Colouring the same graph using three different colour maps.

If you thought the middle one was the best one, I’d agree too. Why would I say that, despite it being grayscale and seemingly being the least colourful of them all?

Colour is not limited to hues (i.e. whether it’s red/white/blue/green etc. etc.); ‘colour’ is also achieved by saturation and brightness (i.e., how vivid a colour is, or dark/light it is). In the case of the middle graph, we’re using brightness to indicate the variations in density and this is a more intuitive display of variations in density. Another advantage of using shades as the means to portray colour is that it will most likely be OK with colourblind users.
Why does the graph on the right not work for this example? This is a case where we use a “sequential” colour map to convey the differences in density. Although the colour legend clarifies what colour belongs to which density bin, without it, it’s very difficult to tell what “red” is with respect to “yellow”. Obviously by having a colour map we know that red means high density and yellow is lower, but without the legend, we can interpret the colours very differently, e.g. as categories, rather than quantities. Basically, when you decide on a sequential colour map, its use must be handled well, and a colour map/legend is critical. Otherwise, we risk putting colours as categories, rather than as continuous values.
Why is the left graph not working well? This is an example of a “diverging” colourmap.
It’s somewhat clear that blue and red define two distinct quantities. Despite this, a major error of this colour map comes in the fact that there’s a white colour in the middle. If the white was used as a “zero crossing” — basically, where a white means the value is 0 — the diverging colour map would have been a more effective tool. However, we can see that matplotlib used white as the median value (by default); this sadly creates the false illusion of a 0 value, as our eyes tend to associate white with missing data, or ‘blanks’. Even if this isn’t your biggest beef with the divergent colour map, we run into the same colour as the sequential colour map, where blue and red don’t convey information (unless specified), and the darkness/lightness of the blue and red are not linked very well without the white in the middle. Thus, it doesn’t do either job very well in this graph. Basically, avoid using divergent colourmaps unless we have two different quantities of values (e.g. data spans from -1 to +1).

2. Colour displays categorical information.
An obvious use of colour is the ability to categorise our data. Anything as simple as a line chart with multiple lines will tell you that colour is terrific at distinguishing groups. This time, notice that the different colour schemes have very different effects:

Colour schemes can instantly differentiate groups.

Notice how this time around, the greyscale method (right) was clearly the losing choice. To begin with, it’s hard to pick out what’s the difference between persons A,B,C, but there’s almost a temptation to think that person A morphs into person C! However, on the left, with a distinct set of colours, there is a clear distinction of persons A, B, and C as the three separate colours. Although a set of distinct three colours is a good thing, bear in mind the following…

Make sure the colours don’t clash with respect to lightness! Try to pick something that’s distinct (blue/red/green), rather than two colours which can be interpreted as two shades of the same colour (red/pink, blue/cyan, etc.)
Pick a palette to choose from – a rainbow is typically the best choice just because it’s the most natural, but feel free to choose your own set of hues. Also include white and black as necessary, so long as it’s clear that they are also part of the palette. White in particular would only work if you have a black outline.
Keep in mind that colour blind readers can have trouble with certain colour combinations (red/yellow/green) and it’s best to steer toward colourblind-friendly palettes.

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sp
from mpl_toolkits.axes_grid1 import make_axes_locatable

### Part 1
# Sample 250 points
np.random.seed(30)
x = np.random.normal(size = 250)
np.random.seed(71)
y = np.random.normal(size = 250)

# Assume the limits of a standard normal are at -3, 3
pmin, pmax = -3, 3

# Create a meshgrid that is 250x250
xgrid, ygrid = np.mgrid[pmin:pmax:250j, pmin:pmax:250j]
pts = np.vstack([xgrid.ravel(), ygrid.ravel()]) # ravel unwinds xgrid from a 250x250 matrix into a 62500x1 array

data = np.vstack([x,y])
kernel = sp.gaussian_kde(data)
density = np.reshape(kernel(pts).T, xgrid.shape) # Evaluate the density for each point in pts, then reshape back to a 250x250 matrix

greys = plt.cm.Greys
bwr = plt.cm.bwr
jet = plt.cm.jet

# Create 3 contour plots
fig, ax = plt.subplots(1,3)
g0 = ax[0].contourf(xgrid, ygrid, density, cmap = bwr)
c0 = ax[0].contour(xgrid, ygrid, density, colors = 'k') # Create contour lines, all black
g1 = ax[1].contourf(xgrid, ygrid, density, cmap = greys)
c1 = ax[1].contour(xgrid, ygrid, density, colors = 'k') # Create contour lines, all black
g2 = ax[2].contourf(xgrid, ygrid, density, cmap = jet)
c2 = ax[2].contour(xgrid, ygrid, density, colors = 'k') # Create contour lines, all black

# Divide each axis then place a colourbar next to it
div0 = make_axes_locatable(ax[0])
cax0 = div0.append_axes('right', size = '10%', pad = 0.1) # Append a new axes object
cb0  = plt.colorbar(g0, cax = cax0)

div1 = make_axes_locatable(ax[1])
cax1 = div1.append_axes('right', size = '10%', pad = 0.1)
cb1  = plt.colorbar(g1, cax = cax1)

div2 = make_axes_locatable(ax[2])
cax2 = div2.append_axes('right', size = '10%', pad = 0.1)
cb2  = plt.colorbar(g2, cax = cax2)

fig.set_size_inches((15,5))
plt.tight_layout()
plt.savefig('normals.png', dpi = 300)
plt.close('all')

### Part 2
years = np.arange(1999, 2017, 1)
np.random.seed(20)
progress1 = np.random.randint(low=500, high =600, size = len(years))
np.random.seed(30)
progress2 = np.random.randint(low=500, high =600, size = len(years))
np.random.seed(40)
progress3 = np.random.randint(low=500, high =600, size = len(years))

fig, ax = plt.subplots(1,2)
ax[0].plot(years, progress1, label = 'Person A', c = '#348ABD')
ax[0].plot(years, progress2, label = 'Person B', c = '#00de00')
ax[0].plot(years, progress3, label = 'Person C', c = '#A60628')
ax[0].set_xlabel("Years")
ax[0].set_ylabel("Progress")
ax[0].legend()

ax[1].plot(years, progress1, label = 'Person A', c = 'black')
ax[1].plot(years, progress2, label = 'Person B', c = 'gray')
ax[1].plot(years, progress3, label = 'Person C', c = '#3c3c3c')
ax[1].set_xlabel("Years")
ax[1].set_ylabel("Progress")
ax[1].legend()

fig.set_size_inches((10,5))
plt.tight_layout()
plt.savefig('colourgrps.png', dpi = 300)
plt.close('all')

Visualising Biological Data, Pt. 2

Here’s a little quick round-up of some of the tools/algorithms that I’ve seen in VIZBI, which I believe can be useful for many. For more details, I strongly advise you check out the posters page (vizbi.org/Posters/2016). There were a few that I would’ve liked to re-visit, but the webapps weren’t available (e.g. MeshCloud from the Human Genome Center, Tokyo), so maybe I’ll come back with a part 3. Here are my top five:

1. Autodesk’s Protein Viewer* (shout-outs to @_merrywang on Twitter)
As a structural bioinformatician, I’m going to be really biased here, and say that Autodesk’s Molecule viewer was the best tool that was showcased in the conference. It combines not only the capacity to visualise millions of molecules from the PDB (or your own PDB files), it also allows annotation and sharing, effectively, “snapshots” of your workspace for collaboration (see this if you want to know what I mean). AND it’s free! It’s not the fastest viewer on the planet, nor the easiest thing to use, but it is effective.

2. Vectorbase
Not related to protein structures, but a really interesting visualisation that shows information on, for example, insecticide resistance. With mosquitoes being such a huge part of today’s news, this kind of information is vital for fighting and understanding the distribution of insects across the globe.

3. Phandango
This is a genome browser which, from a one-man effort, could be a game-changer. The UI needs a little bit of work I think, but otherwise, a really valuable tool for crunching lots of genomic data in a quick fashion.

4. i-PV Circos
This is a neat circular browser that helps users view protein sequences in a circularised format. With this visualisation format becoming more popular as the days go by, I think this has the potential to be a leader in the field. At the moment the website’s a bit dark and not the most user-friendly, but some of the core functionality (e.g. highlighting residues and association of domains) is a real plus!

5. Storyline visualisation
Possibly my favourite/eye-opener tool from the entire conference. Storyline visualisation helps users understand how things progress in realtime — this has been used for movie plot data (e.g. Star Wars character and plot progression) but the general concept can be useful for biological phenomena – for example, how do cells in diseased states progress over time? How does it compare to healthy states? Can we also monitor protein dynamics using a similar concept? I think the fact that it gives a very intuitive, big-picture overview of the micro-scale dynamics was the reason why I’ve been incredibly interested in Kwan-Liu Ma’s work, and I recommend checking out his website/publications list to grab insight on improving data visualisation (in particular, network visualisation when you want to avoid hairballs!)

The list isn’t ranked in any way, and do check these out! There were other tools I would’ve really liked to review (e.g. Minardo, made by David Ma @frostickle on Twitter), but I suppose I can go on and on. At the end of the day, visualisation tools like these are meant to be quick, and help us to not only EXPLORE our data, but to EXPLAIN it too. I think we’re incredibly fortunate to have some amazing minds out there who are willing to not only create these tools, but also make them available for all.

Visualising Biological Data, Pt. 1

Hey Blopig Readers,

I had the privilege to go down to Heidelberg last week to go and see some stunning posters and artwork. I really recommend that you check some of the posters out. In particular, the “Green Fluorescent Protein” poster stuck out as my favourite. Also, if you’re a real Twitter geek, check out #Vizbi for some more tweets throughout the week.

So what did the conference entail? As a very blunt summary, it was really an eclectic collection of researchers around the globe who showcased their research with very neat visual media. While I was hoping for a conference that gave an overview of some of the principles that dictate how to visualise proteins, genes, etc., it wasn’t like that at all! Although I was initially a bit disappointed, it turned out to be better – one of the key themes that were re-iterated throughout the conference is that visualisations are dependent on the application!

From the week, these are the top 5 lessons I walked away with, and I hope you can integrate this into your own visualisation:

There is no pre-defined, accepted way of visualising data. Basically, every visualisation is tailored, has a specific purpose, so don’t try to fit your graph into something pretty that you’ve seen in another paper. We’re encouraged to get insight from others, but not necessarily replicate a graph.
KISS (Keep it simple, stupid!) Occam’s razor, KISS, whatever you want to call it – keep things simple. Making an overly complicated visualisation may backfire.
Remember your colours. Colour is probably one of the most powerful tools in our arsenal for making the most of a visualisation. Don’t ignore them, and make sure that they’re clean, separate, and interpretable — even to those who are colour-blind!
Visualisation is a means of exploration and explanation. Make lots, and lots of prototypes of data visuals. It will not only help you explore the underlying patterns in your data, but help you to develop the skills in explaining your data.
Don’t forget the people. Basically, a visualisation is really for a specific target audience, not for a machine. What you’re doing is to encourage connections, share knowledge, and create an experience so that people can learn your data.

I’ll come back in a few weeks’ time after reviewing some tools, stay tuned!

We can model everything, right…?

First, happy new year to all our Blopig fans, and we all hope 2016 will be awesome!

A couple of months ago, I was covering this article by Shalom Rackovsky. The big question that jumps out of the paper is, has modelling reached its limits? Or, in other words, can bioinformatics techniques be used to model every protein? The author argues that protein structures have an inherent level of variability that cannot be fully captured by computational methods; thus, he raises some scepticism on what modelling can achieve. This isn’t entirely news; competitions such as CASP show that there’s still lots to work on in this field. This article takes a very interesting spin when Rackovsky uses a theoretical basis to justify his claim.

For a pair of proteins P and Q, Rackovsky defines their relationship depending on their sequence and structural identity. If P and Q share a high level of sequence identity but have little structural resemblance, P and Q are considered to be a conformational switch. Conversely, if P and Q share a low level of sequence identity but have high structural resemblance, they are considered to be remote homologues.

Case of a conformational switch – two DNAPs with 100% seq identity but 5.3A RMSD.

Haemoglobins are ‘remote homolgues’ – despite 19% sequence identity, these two proteins have 1.9A RMSD.

From here on comes the complex maths. Rackovsky’s work here (and in papers prior, example) assume that there are periodicities in properties of proteins, and thus apply fourier transforms to compare protein sequences and structures.

In the case of comparing protein sequences, instead of treating sequences as a string of letters, protein sequences are characterised by an N x 10 matrix. N represents the number of amino acids in protein P (or Q), and each amino acid has 10 biophysical properties. The matrix then undergoes Fourier Transformation (FT), and the resulting sine and cosine coefficients for proteins P and Q are used to calculate the Euclidean distance between each other.

When comparing structures, proteins are initially truncated into length-L fragments, and the dihedral angle, bond length and bond angle for each fragment is collected into a matrix. The distribution of matrices allows us to project proteins onto a pre-parameterised principal components space. The Euclidean distance between the newly-projected proteins is then used to quantify protein structural similarity.

In both sequence and structure distances, the distances are normalised and centred around 0,0 by calculating the average distance between P and its M-nearest neighbours, and then adjusted by the global average. Effectively, if a protein has an average structural distance, it will tend toward 0,0.

The author uses a dataset of 12000 proteins from the CATH set to generate the following diagram; the Y-axis represents sequence similarity and the X-axis is the structural similarity. Since these axes are scaled to the mean, the closer you are to 0, it means you’re closer to the global average sequence or structure distance.

The four quadrants: along the diagonal is a typical linear relationship (greater sequence identity = more structural similarity). The lower-right quadrant represents proteins with LOW sequence similarity yet HIGH structural similarity. In the upper-left quadrant, proteins have LOW structural similarity but HIGH sequence similarity.

Rackovsky argues that, while the remote homologue and conformational switch seem like rare phenomena, it accounts for approximately ~50% of his dataset. Although he does account for the high density of proteins within 0,0, the paper does not clearly address the meaning of these new metrics. In other words, the author does not translate these values to something we’re more familiar with (e.g.RMSD, and sequence identity % for structural and sequence distance). Although the whole idea is that his methods are supposed to be an alignment-free method, it’s still difficult to draw relationships to what we already use as the gold standard in traditional protein structure prediction problems. Also, note that the structure distance spans between -0.1 and 0.1 units whereas sequence identity spans between -0.3 and 0.5. The differences in scale are also not covered – i.e., is a difference of 0.01 units an expected value for protein structure distance, and why are the jumps in protein structure distance so much smaller than jumps in sequence space?

The author makes more interesting observations in the dataset (e.g. α/β mixed proteins are more tolerant to mutations in comparison to α- or β-only proteins) but the observations are not discussed in depth. If α/β-mixed proteins are indeed more resilient to mutations, why is this the case? Conversely, if small mutations change α- or β-only proteins’ structures to make new folds, having any speculation on the underlying mechanism (e.g. maybe α-only proteins are only sensitive to radically different amino acid substitutions, such as ALA->ARG) will only help our prediction methods. Overall I had the impression that the author was a bit too pessimistic about what modelling can achieve. Though we definitely cannot model all proteins that are out there at present, I believe the surge of new sources of data (e.g. cryo-EM structures) will provide an alternative inference route for better prediction methods in the future.

Oxford Protein Informatics Group

or "OPIG" to friends