# What happens to the Human Immune Repertoire over time?

Last week during the group meeting we talked about a pre-print publication from the Ippolito group in Austin TX (here). The authors were monitoring the antibody repertoire from Bone Marrow Plasma Cells (80% of circulating abs) over a period of 6.5 years. For comparison they have monitored another individual over the period of 2.3 years. In a nutshell, the paper is like Picture 1 — just with antibodies

This is what the paper talks about in a nutshell. How the antibody repertoire looks like taken at different timepoints in an individual’s lifetime.

.

The main question that they aimed to answer was: ‘Is the Human Antibody repertoire stable over time‘? It is plausible to think that there should be some ‘ground distribution’ of antibodies that are present over time which act as a default safety net. However we know that the antibody makeup can change radically especially when challenged by antigen. Therefore, it is interesting to ask, does the immune repertoire maintain a fairly stable distribution or not?

Firstly, it is necessary to define what we consider a stable distribution of the human antibody repertoire. The antibodies undergo the VDJ recombination as well as Somatic Hypermutation, meaning that the >10^10 estimated antibodies that a human is capable of producing have a very wide possible variation. In this publication the authors mostly focused on addressing this question by looking at how the usage of possible V, D and J genes and their combinations changes over time.

Seven snapshots of the immune repertoire were taken from the individual monitored over 6.5 years and two from the individual monitored over 2.3 years. Looking at the usage of the V, D and J genes over time, it appears that the proportion in each of the seven time points appears quite stable (Pic 2). Authors claim similar result looking at the combinations.  This would suggest that our antibody repertoire is biased to sample ‘similar’ antibodies over time. These frequencies were compared to the individual who was sampled over the period of 2.3 years and it appears that the differences might not be great between the two.

How the frequencies of V, D and J genes change (not) over 6.5 years in a single individual

It is a very interesting study which hints that we (humans) might be sampling the antibodies from a biased distribution — meaning that our bodies might have developed a well-defined safety net which is capable of raising an antibody towards an arbitrary antigen. It is an interesting starting point and to further check this hypothesis, it would be necessary to carry out such a study on multiple individuals (as a minimum to see if there are really no differences between us — which would at the same time hint that the repertoire do not change over time).

# Processing large files using python: part duex

Last week I wrote a post on some of the methods I use in python to efficiently process very large datasets. You can find that here. Roughly it details how one can break a large file into chunks which then can be passed onto multiple cores to do the number crunching. Below I expand upon this, first creating a parent class which turns a given (large) file into chunks. I construct it in a manner which children classes can be easily created and tailored for specific file types, given some examples. Finally, I give some wrapping functions for use in conjunction with any of the chunkers so that the chunks can be processed using multiple cores.

First, and as an aside, I was asked after the previous post, at what scale these methods should be considered. A rough answer would be when the size of the data becomes comparable to the available RAM. A better answer would be, when the overhead of reading each individual line(/entry) is more than the operation on that entry. Here is an example of this case, though it isn’t really that fair a comparison:


>>> import timeit,os.path
>>> os.path.getsize("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa")
10095955
>>> timeit.timeit('f = open("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa");sum([l.count(">") for l in f])',number=10)
0.8403599262237549
>>> timeit.timeit('f = open("Saccharomyces_cerevisiae.R64-1-1.cds.all.fa");sum([c.count(">") for c in iter(lambda: f.read(1024*1024),"")])',number=10)
0.15671014785766602



For a small 10MB fasta file, we count the number of sequences present in a fifth of the time using chunks. I should be honest though, and state that the speedup is mostly due not having the identify newline characters in the chunking method; but nevertheless, it shows the power one can have using chunks. For a 14GB fasta file, the times for the chunking (1Mb chunks) and non-chunking methods are 55s and 130s respectively.

Getting back on track, let’s turn the chunking method into a parent class from which we can build on:


import os.path

class Chunker(object):

#Iterator that yields start and end locations of a file chunk of default size 1MB.
@classmethod
def chunkify(cls,fname,size=1024*1024):
fileEnd = os.path.getsize(fname)
with open(fname,'r') as f:
chunkEnd = f.tell()
while True:
chunkStart = chunkEnd
f.seek(size,1)
cls._EOC(f)
chunkEnd = f.tell()
yield chunkStart, chunkEnd - chunkStart
if chunkEnd >= fileEnd:
break

#Move file pointer to end of chunk
@staticmethod
def _EOC(f):

@staticmethod
with open(fname,'r') as f:
f.seek(chunk[0])

#iterator that splits a chunk into units
@staticmethod
def parse(chunk):
for line in chunk.splitlines():
yield chunk



In the above, we create the class Chunker which has the class method chunkify as well as the static methods, _EOC, read, and parse. The method chunkify does the actual chunking of a given file, returning an iterator that yields tuples containing a chunk’s start and size. It’s a class method so that it can make use of _EOC (end of chunk) static method, to move the pointer to a suitable location to split the chunks. For the simplest case, this is just the end/start of a newline. The read and parse methods read a given chunk from a file and split it into units (single lines in the simplest case) respectively. We make the non-chunkify methods static so that they can be called without the overhead of creating an instance of the class.

Let’s now create some children of this class for specific types of files. First, one of the most well known file types in bioinformatics, FASTA. Below is an segment of a FASTA file. Each entry has a header line, which begins with a ‘>’, followed by a unique identifier for the sequence. After the header line, one or more lines follow giving the sequence. Sequences may be either protein or nucleic acid sequences, and they may contain gaps and/or alignment characters.


>SEQUENCE_1
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI



And here is the file type specific chunker:


from Bio import SeqIO
from cStringIO import StringIO

class Chunker_FASTA(Chunker):

@staticmethod
def _EOC(f):
p = f.tell()
while l and l[0] != '>': #find the start of sequence
p = f.tell()
f.seek(p) #revert one line

@staticmethod
def parse(chunk):
fh = cStringIO.StringIO(chunk)
for record in SeqIO.parse(fh,"fasta"):
yield record
fh.close()



We update the _EOC method to find when one entry finishes and the next begins by locating “>”, following which we rewind the file handle pointer to the start of that line. We also update the parse method to use fasta parser from the BioPython module, this yielding SeqRecord objects for each entry in the chunk.

For a second slightly harder example, here is one designed to work with output produced by bowtie, an aligner of short reads from NGS data. The format consists of of tab separated columns, with the id of each read located in the first column. Note that a single read can align to multiple locations (up to 8 as default!), hence why the same id appears in multiple lines. A small example section of the output is given below.


SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN25-2	2502 GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN25-1	2502	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3
SRR014374.1 HWI-EAS355_3_Nick_1_1_464_1416 length=36	+	RDN37-1	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN25-1	2502	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN37-1	4460	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3
SRR014374.2 HWI-EAS355_3_Nick_1_1_341_1118 length=36	+	RDN25-2	2502	GTTTCTTTACTTATTCAATGAAGCG	IIIIIIIIIIIIIIIIIIIIIIIII	3
SRR014374.8 HWI-EAS355_3_Nick_1_1_187_1549 length=36	+	RDN25-2	2502	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3
SRR014374.8 HWI-EAS355_3_Nick_1_1_187_1549 length=36	+	RDN37-2	4460	GTTTCTTTACTTATTCAATGAAGCGG	IIIIIIIIIIIIIIIIIIIIIIIIII	3



with the corresponding chunker given by:


class Chunker_BWT(chunky.Chunker):

@staticmethod
def _EOC(f):
if not l: return #EOF
p = f.tell()
f.seek(p) #revert one line

@staticmethod
def parse(chunk):
lines = chunk.splitlines()
N = len(lines)
i = 0
while i < N:
j = i
j += 1
if j == N:
break
yield lines[i:j]
i = j



This time, the end of chunk is located by reading lines until there is a switch in the read id, whereupon we revert one line. For parsing, we yield all the different locations a given read aligns to as a single entry.

Hopefully these examples show you how the parent class can be expanded upon easily for most file types. Let’s now combine these various chunkers with the code from previous post to show how we can enable multicore parallel processing of the chunks they yield. The code below contains a few generalised wrapper functions which work in tandem with any of the above chunkers to allow most tasks to be parallelised .


import multiprocessing as mp, sys

def _printMP(text):
print text
sys.stdout.flush()

def _workerMP(chunk,inFile,jobID,worker,kwargs):
_printMP("Processing chunk: "+str(jobID))
output = worker(chunk,inFile,**kwargs)
_printMP("Finished chunk: "+str(jobID))
return output

def main(inFile,worker,chunker=Chunker,cores=1,kwargs={}):
pool = mp.Pool(cores)

jobs = []
for jobID,chunk in enumerate(chunker.chunkify(inFile)):
job = pool.apply_async(_workerMP,(chunk,inFile,jobID,worker,kwargs))
jobs.append(job)

output = []
for job in jobs:
output.append( job.get() )

pool.close()

return output



The main function should be recognisable as the code from the previous post. It generates the pool of workers, ideally one for each core, before using the given chunker to split the corresponding file into a series of chunks for processing. Unlike before, we collect the output given by each job and return it after processing is finished. The main function acts as wrapper allowing us to specify different processing functions and different chunkers, as given by the variables worker and chunker respectively. We have wrapped the processing function call within the function _workerMP which prints to the terminal as tasks are completed. It uses the function _printMP to do this, as you need to flush the terminal after a print statement when using multi core processing, otherwise nothing appears until all tasks are completed.

Let’s finish by showing an example of how we would use the above to do the same task as we did at the start of this post, counting the sequences within a fasta file, using the base chunker:


def seq_counter(chunk,inFile):
return data.count('>')



and using the FASTA chunker:


def seq_counter_2(chunk,inFile):
return len(data)



And time they take to count the sequences within the 14GB file from before:


>>> os.path.getsize("SRR951828.fa")
14944287128
>>> x=time.time();f = open("SRR951828.fa");sum([l.count(">") for l in f]);time.time()-x
136829250
190.05533599853516
>>> x=time.time();f = open("SRR951828.fa");sum([c.count(">") for c in iter(lambda: f.read(1024*1024),"")]);time.time()-x
136829250
26.343637943267822
>>> x=time.time();sum(main("SRR951828.fa",seq_counter,cores=8));time.time()-x
136829250
4.36846399307251
>>> x=time.time();main("SRR951828.fa",seq_counter_2,Chunker_FASTA,8);time.time()-x
136829250
398.94060492515564



Let’s ignore that last one, as the slowdown is due to turning the entries into BioPython SeqRecords. The prior one, which combines chunking and multicore processing, has roughly a factor of 50 speed up. I’m sure this could be further reduced using more cores and/or optimising the chunk size, however, this difference alone can change something from being computationally implausible, to plausible. Not bad for only a few lines of code.

Anyway, as before, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. If I feel like it, I may follow this up with a post about how to integrate a queue into the above which outputs the result of each job as they are produced. In the above, we currently hold collate all the results in the memory, which has the potential to cause a memory overflow depending on what is being returned.

# Processing large files using python

In the last year or so, and with my increased focus on ribo-seq data, I have come to fully appreciate what the term big data means. The ribo-seq studies in their raw forms can easily reach into hundreds of GBs, which means that processing them in both a timely and efficient manner requires some thought. In this blog post, and hopefully those following, I want to detail some of the methods I have come up (read: pieced together from multiple stack exchange posts), that help me take on data of this magnitude. Specifically I will be detailing methods for python and R, though some of the methods are transferrable to other languages.

My first big data tip for python is learning how to break your files into smaller units (or chunks) in a manner that you can make use of multiple processors. Let’s start with the simplest way to read a file in python.


with open("input.txt") as f:
for line in data:
process(line)



This mistake made above, with regards to big data, is that it reads all the data into RAM before attempting to process it line by line. This is likely the simplest way to cause the memory to overflow and an error raised. Let’s fix this by reading the data in line by line, so that only a single line is stored in the RAM at any given time.


with open("input.txt") as f:
for line in f:
process(line)



This is a big improvement, namely it doesn’t crash when fed a big file (though also it’s shorter!). Next we should attempt to speed this up a bit by making use of all these otherwise idle cores.


import multiprocessing as mp

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
for line in f:
jobs.append( pool.apply_async(process,(line)) )

#wait for all jobs to finish
for job in jobs:
job.get()

#clean up
pool.close()



Provided the order of which you process the lines don’t matter, the above generates a set (pool) of workers, ideally one for each core, before creating a bunch of tasks (jobs), one for each line, for the workers to do. I tend to use the Pool object provided by the multiprocessing module due to ease of use, however, you can spawn and control individual workers using mp.Process if you want finer control. For mere number crunching, the Pool object is very good.

While the above is now making use of all those cores, it sadly runs into memory problems once again. We specifically use apply_async function so that the pool isn’t blocked while each line processes. However, in doing so, all the data is read into memory once again; this time stored as individual lines associated with each job, waiting inline to be processed. As such, the memory will again overflow. Ideally the method will only read the line into memory when it is its turn to be processed.


import multiprocessing as mp

def process_wrapper(lineID):
with open("input.txt") as f:
for i,line in enumerate(f):
if i != lineID:
continue
else:
process(line)
break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
for ID,line in enumerate(f):
jobs.append( pool.apply_async(process_wrapper,(ID)) )

#wait for all jobs to finish
for job in jobs:
job.get()

#clean up
pool.close()



Above we’ve now changed the function fed to pool of workers to include opening the file, locating the specified line, reading it into memory, and then processing it. The only input now stored for each job spawned is the line number, thereby preventing the memory overflow. Sadly, the overhead involved in having to locate the line by reading iteratively through the file for each job is untenable, getting progressively more time consuming as you get further into the file. To avoid this we can use the seek function of file objects which skips you to a particular location within a file. Combining with the tell function, which returns the current location within a file, gives:


import multiprocessing as mp

def process_wrapper(lineByte):
with open("input.txt") as f:
f.seek(lineByte)
process(line)

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
with open("input.txt") as f:
nextLineByte = f.tell()
for line in f:
jobs.append( pool.apply_async(process_wrapper,(nextLineByte)) )
nextLineByte = f.tell()

#wait for all jobs to finish
for job in jobs:
job.get()

#clean up
pool.close()



Using seek we can move directly to the correct part of the file, whereupon we read a line into the memory and process it. We have to be careful to correctly handle the first and last lines, but otherwise this does exactly what we set out, namely using all the cores to process a given file while not overflowing the memory.

I’ll finish this post with a slight upgrade to the above as there is a reasonable amount of overhead associated with opening and closing the file for each individual line. If we process multiple lines of the file at a time as a chunk, we can reduce these operations. The biggest technicality when doing this is noting that when you jump to a location in a file, you are likely not located at the start of a line. For a simple file, as in this example, this just means you need to call readline, which reads to next newline character. More complex file types likely require additional code to locate a suitable location to start/end a chunk.


import multiprocessing as mp,os

def process_wrapper(chunkStart, chunkSize):
with open("input.txt") as f:
f.seek(chunkStart)
for line in lines:
process(line)

def chunkify(fname,size=1024*1024):
fileEnd = os.path.getsize(fname)
with open(fname,'r') as f:
chunkEnd = f.tell()
while True:
chunkStart = chunkEnd
f.seek(size,1)
chunkEnd = f.tell()
yield chunkStart, chunkEnd - chunkStart
if chunkEnd > fileEnd:
break

#init objects
pool = mp.Pool(cores)
jobs = []

#create jobs
for chunkStart,chunkSize in chunkify("input.txt"):
jobs.append( pool.apply_async(process_wrapper,(chunkStart,chunkSize)) )

#wait for all jobs to finish
for job in jobs:
job.get()

#clean up
pool.close()



Anyway, I hope that some of the above was either new or even perhaps helpful to you. If you know of a better way to do things (in python), then I’d be very interested to hear about it. In another post coming in the near future, I will expanded on this code, turning it into a parent class from which create multiple children to use with various file types.

# Colour wisely…

Colour – the attribute of an image that makes it acceptable or destined for the bin. Colour has a funny effect on us – it’s a double-edged sword that greatly strengthens, or weakens data representation in such a huge level. No one really talks about what’s a good way to colour an image or a graph, but it’s something that most can agree as being pleasing, or disgusting. There are two distinctive advantages to colouring a graph: it conveys both quantitative and categorical information very, very well. Thus, I will provide a brief overview (with code) on how colour can be used to display both quantitative and qualitative information. (*On the note of colours, Nick has previously discussed how colourblindness must be considered in visualising data…).

1. Colour conveys quantitative information.
A huge advantage of colour is that it can provide quantitative information, but this has to be done correctly. Here are three graphs showing the exact same information (the joint density of two normal distributions) and  we can see from the get-go which method is the best at representing the density of the two normal distributions:

Colouring the same graph using three different colour maps.

If you thought the middle one was the best one, I’d agree too. Why would I say that, despite it being grayscale and seemingly being the least colourful of them all?

• Colour is not limited to hues (i.e. whether it’s red/white/blue/green etc. etc.); ‘colour’ is also achieved by saturation and brightness (i.e., how vivid a colour is, or dark/light it is). In the case of the middle graph, we’re using brightness to indicate the variations in density and this is a more intuitive display of variations in density. Another advantage of using shades as the means to portray colour is that it will most likely be OK with colourblind users.
• Why does the graph on the right not work for this example? This is a case where we use a “sequential” colour map to convey the differences in density. Although the colour legend clarifies what colour belongs to which density bin, without it, it’s very difficult to tell what “red” is with respect to “yellow”. Obviously by having a colour map we know that red means high density and yellow is lower, but without the legend, we can interpret the colours very differently, e.g. as categories, rather than quantities. Basically, when you decide on a sequential colour map, its use must be handled well, and a colour map/legend is critical. Otherwise, we risk putting colours as categories, rather than as continuous values.
• Why is the left graph not working well? This is an example of a “diverging” colourmap.
It’s somewhat clear that blue and red define two distinct quantities. Despite this, a major error of this colour map comes in the fact that there’s a white colour in the middle. If the white was used as a “zero crossing” — basically, where a white means the value is 0 — the diverging colour map would have been a more effective tool. However, we can see that matplotlib used white as the median value (by default); this sadly creates the false illusion of a 0 value, as our eyes tend to associate white with missing data, or ‘blanks’. Even if this isn’t your biggest beef with the divergent colour map, we run into the same colour as the sequential colour map, where blue and red don’t convey information (unless specified), and the darkness/lightness of the blue and red are not linked very well without the white in the middle. Thus, it doesn’t do either job very well in this graph. Basically, avoid using divergent colourmaps unless we have two different quantities of values (e.g. data spans from -1 to +1).

2. Colour displays categorical information.
An obvious use of colour is the ability to categorise our data. Anything as simple as a line chart with multiple lines will tell you that colour is terrific at distinguishing groups. This time, notice that the different colour schemes have very different effects:

Colour schemes can instantly differentiate groups.

Notice how this time around, the greyscale method (right) was clearly the losing choice. To begin with, it’s hard to pick out what’s the difference between persons A,B,C, but there’s almost a temptation to think that person A morphs into person C! However, on the left, with a distinct set of colours, there is a clear distinction of persons A, B, and C as the three separate colours. Although a set of distinct three colours is a good thing, bear in mind the following…

• Make sure the colours don’t clash with respect to lightness! Try to pick something that’s distinct (blue/red/green), rather than two colours which can be interpreted as two shades of the same colour (red/pink, blue/cyan, etc.)
• Pick a palette to choose from – a rainbow is typically the best choice just because it’s the most natural, but feel free to choose your own set of hues. Also include white and black as necessary, so long as it’s clear that they are also part of the palette. White in particular would only work if you have a black outline.
• Keep in mind that colour blind readers can have trouble with certain colour combinations (red/yellow/green) and it’s best to steer toward colourblind-friendly palettes.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as sp
from mpl_toolkits.axes_grid1 import make_axes_locatable

### Part 1
# Sample 250 points
np.random.seed(30)
x = np.random.normal(size = 250)
np.random.seed(71)
y = np.random.normal(size = 250)

# Assume the limits of a standard normal are at -3, 3
pmin, pmax = -3, 3

# Create a meshgrid that is 250x250
xgrid, ygrid = np.mgrid[pmin:pmax:250j, pmin:pmax:250j]
pts = np.vstack([xgrid.ravel(), ygrid.ravel()]) # ravel unwinds xgrid from a 250x250 matrix into a 62500x1 array

data = np.vstack([x,y])
kernel = sp.gaussian_kde(data)
density = np.reshape(kernel(pts).T, xgrid.shape) # Evaluate the density for each point in pts, then reshape back to a 250x250 matrix

greys = plt.cm.Greys
bwr = plt.cm.bwr
jet = plt.cm.jet

# Create 3 contour plots
fig, ax = plt.subplots(1,3)
g0 = ax[0].contourf(xgrid, ygrid, density, cmap = bwr)
c0 = ax[0].contour(xgrid, ygrid, density, colors = 'k') # Create contour lines, all black
g1 = ax[1].contourf(xgrid, ygrid, density, cmap = greys)
c1 = ax[1].contour(xgrid, ygrid, density, colors = 'k') # Create contour lines, all black
g2 = ax[2].contourf(xgrid, ygrid, density, cmap = jet)
c2 = ax[2].contour(xgrid, ygrid, density, colors = 'k') # Create contour lines, all black

# Divide each axis then place a colourbar next to it
div0 = make_axes_locatable(ax[0])
cax0 = div0.append_axes('right', size = '10%', pad = 0.1) # Append a new axes object
cb0  = plt.colorbar(g0, cax = cax0)

div1 = make_axes_locatable(ax[1])
cax1 = div1.append_axes('right', size = '10%', pad = 0.1)
cb1  = plt.colorbar(g1, cax = cax1)

div2 = make_axes_locatable(ax[2])
cax2 = div2.append_axes('right', size = '10%', pad = 0.1)
cb2  = plt.colorbar(g2, cax = cax2)

fig.set_size_inches((15,5))
plt.tight_layout()
plt.savefig('normals.png', dpi = 300)
plt.close('all')

### Part 2
years = np.arange(1999, 2017, 1)
np.random.seed(20)
progress1 = np.random.randint(low=500, high =600, size = len(years))
np.random.seed(30)
progress2 = np.random.randint(low=500, high =600, size = len(years))
np.random.seed(40)
progress3 = np.random.randint(low=500, high =600, size = len(years))

fig, ax = plt.subplots(1,2)
ax[0].plot(years, progress1, label = 'Person A', c = '#348ABD')
ax[0].plot(years, progress2, label = 'Person B', c = '#00de00')
ax[0].plot(years, progress3, label = 'Person C', c = '#A60628')
ax[0].set_xlabel("Years")
ax[0].set_ylabel("Progress")
ax[0].legend()

ax[1].plot(years, progress1, label = 'Person A', c = 'black')
ax[1].plot(years, progress2, label = 'Person B', c = 'gray')
ax[1].plot(years, progress3, label = 'Person C', c = '#3c3c3c')
ax[1].set_xlabel("Years")
ax[1].set_ylabel("Progress")
ax[1].legend()

fig.set_size_inches((10,5))
plt.tight_layout()
plt.savefig('colourgrps.png', dpi = 300)
plt.close('all')


# Seventh Joint Sheffield Conference on Cheminformatics Part 1 (#ShefChem16)

In early July I attended the the Seventh Joint Sheffield Conference on Cheminformatics. There was a variety of talks with speakers at all stages of their career. I was lucky enough to be invited to speak at the conference, and gave my first conference talk! I have written two blog posts about the conference: part 1 briefly describes a talk that I found interesting and part 2 describes the work I spoke about at the conference.

One of the most interesting parts of the conference was the active twitter presence. #ShefChem16. All of the talks were live tweeted which provided a summary of each talk and also included links to software or references. It also allowed speakers to gain insight and feedback on their talk instantly.

One of the talks I found most interesting presented the Protein-Ligand Interaction Profiler (PLIP). It is a method for the detection of protein-ligand interactions. PLIP is open-source and has a web-based online tool and a command-line tool. Unlike PyMol which only calculates polar contacts, and not the type of interaction, PLIP calculates 8 different types of interactions: hydrogen bonding, hydrophobic, π-π stacking, π-cation interactions, salt bridges, water bridges, halogen bonds, metal complexes. For a given pdb file the interactions are calculated and shown in a publication quality figure shown here.

The display can also be downloaded as a PyMol session so the display can be modified.

This tool is an extremely useful way to calculate protein-ligand interactions and can be used to find the types of interactions formed by the protein-ligand complex.

PLIP can be found here: https://projects.biotec.tu-dresden.de/plip-web/plip/

# Tracked changes in LaTeX

Maybe people keep telling you Word is great but you are just too emotionally attached to LaTeX to consider using anything else. It just looks so beautiful. Besides, you would have to leave your beloved linux environment (maybe that’s just me), so you stick with what you know. You work for many weeks long and hard, finally producing a draft of a paper that gets the all clear from your supervisor to submit to journal X. Eventually you hear back and the reviewers have responded with some good ideas and a few pedantic points. Apparently this time the journal wants a tracked changes version to go with your revised manuscript.

Highlighting every change sounds like a lot of bother, and besides, you’d have to process the highlighted version to generate the clean version they want you to submit alongside it. There must be a better way, and one that doesn’t involve converting your document to Word.

Thankfully, the internet has an answer! Check out this little package changes which will do just what you need. As long as you annotate using \deleted{}, \replaced{} and \added{} along the way, you will have to change just one word of your tex source file in order to produce the highlighted and final versions. It even comes with a handy bash script to get rid of the resulting mess when you’re happy with the result, leaving you with a clean final tex source file.

The die-hard Word fans won’t be impressed, but you will be very satisfied that you have found a nice little solution that does just the job you want it to. It’s actually capable of much more, including comments by multiple authors, customisation of colours and styles, and an automatically generated summary of changes. I have heard good things about ShareLaTeX for collaboration, but this simple package will get you a long way if you are not keen to start paying money yet.

# Rational Design of Antibody Protease Inhibitors

On the theme of my research area and my last presentation I talked at group meeting about a another success story in structurally designing an antibody by replicating a general protein binding site using grafted fragments involved in the original complex. The paper by Liu et al is important to me for two major reasons . Firstly they used an unconventional antibody for protein design, namely a bovine antibody which is known to have an extended CDR H3. Secondly the fragment was not grafted at the anchor points of the CDR loop.

SFTI-1 is a cyclic peptide and a known trypsin inhibitor. It’s structure is stabilised by a disulphide bridge. The bovine antibody is known to have an extended H3 loop which is essentially a long beta strand stalk with a knob domain at the end. Liu et al removed the knob domain and a portion of the B strand and grafted the acyclic version of the SFTI-1 to it. As I said above this result is very important because it shows we can graft a fragment/loop at places different then the anchor points of the CDR. This opens up the possibility for more diverse fragments to be grafted because of new anchor points,  and also because the fragment will sit further away from the other CDRs and the framework allowing more conformational space. To see how the designed antibody compares to the original peptide they measured the Kd and found a 4 fold increase (9.57 vs 13.3). They hypothesise that this is probably due to the fact that the extended beta strand on the antibody keeps the acyclic SFTI-1 peptide in a more stable conformation.

The problem with the bovine antibody is that if inserted in a human subject it would probably elicit an immune response from the native immune system. To humanise this antibody they found the human framework which shares the greatest sequence identity to the bovine antibody and then grafted the fragment on it. The human antibody does not have an extended CDR H3 and to decide what is the best place of grafting they tried various combinations again showing again that the fragments do not need to grafted exactly at the anchor points. Some of the resulting  human antibodies showed even more impressive Kds.

The best designed human antibody had a 0.79nM Kd, another 10-fold gain . Liu et al hypothesised that this might be due to the fact that the cognate protein forms contacts with residues on the other CDRs even though there is no crystal structure to show this. In order to test this hypothesis they mutated surface residues on the H2 and L1 loop to Alanine which resulted in a 6.7 fold decrease in affinity. The only comment I would have to this is that the mutations to the other CDRs might have destabilized the other CDRs on the antibody which could be the reason for the decrease in affinity.

# Quantifying dispersion under varying instrument precision

Experimental errors are common at the moment of generating new data. Often this type of errors are simply due to the inability of the instrument to make precise measurements. In addition, different instruments can have different levels of precision, even-thought they are used to perform the same measurement. Take for example two balances and an object with a mass of 1kg. The first balance, when measuring this object different times might record values of 1.0083 and 1.0091, and the second balance might give values of 1.1074 and 0.9828. In this case the first balance has a higher precision as the difference between its measurements is smaller than the difference between the measurements of balance two.

In order to have some control over this error introduced by the level of precision of the different instruments, they are labelled with a measure of their precision $latex 1/\sigma_i^2$ or equivalently with their dispersion $latex \sigma_i^2$ .

Let’s assume that the type of information these instruments record is of the form $latex X_i=C + \sigma_i Z$,  where $latex Z \sim N(0,1)$ is an error term, $latex X_i$ its the value recorded by instrument $latex i$ and where $latex C$ is the fixed true quantity of interest the instrument  is trying to measure. But, what if $latex C$ is not a fixed quantity? or what if the underlying phenomenon that is being measured is also stochastic like the measurement $latex X_i$. For example if we are measuring the weight of cattle at different times, or the length of a bacterial cell, or concentration of a given drug in an organism, in addition to the error that arises from the instruments; there is also some noise introduced by dynamical changes of the object that is being measured. In this scenario, the phenomenon of interest, can be given  by a random variable $latex Y \sim N(\mu,S^2)$. Therefore the instruments would record quantities of the form $latex X_i=Y + \sigma_i Z$.

Under this case, estimating the value of $latex \mu$, the expected state of the phenomenon of interest is not a big challenge. Assume that there are $latex x_1,x_2,…,x_n$ values observed from realisations of the variables $latex X_i \sim N(\mu, \sigma_i^2 + S^2)$, which came from $latex n$ different instruments. Here $latex \sum x_i /n$ is still a good estimation of $latex \mu$ as $latex E(\sum X_i /n)=\mu$.  Now, a more challenging problem is to infer what is the underlying variability of the phenomenon of interest $latex Y$. Under our previous setup, the problem is reduced to estimating $latex S^2$ as we are assuming $latex Y \sim N(\mu,S^2)$ and that the instruments record values of the from $latex X_i=Y + \sigma_i Z$.

To estimate $latex S^2$ a standard maximum likelihood approach could be used, by considering the likelihood function:

$latex f(x_1,x_2,..,x_n)= \prod e^{-1/2 \times (x_i-\mu)^2 /(\sigma_i^2+S^2)} \times 1/\sqrt{2 \pi (\sigma_i^2+S^2) }$,

from which the maximum likelihood estimator of $latex S^2$ is given by the solution to

$latex \sum [(X_i- \mu)^2 – (\sigma_i^2 + S^2)] /(\sigma_i^2 + S^2)^2 = 0$.

Another more naive approach could use the following result

$latex E[\sum (X_i-\sum X_i/n)^2] = (1-1/n) \sum \sigma_i^2 + (n-1) S^2$

from which $latex \hat{S^2}= (\sum (X_i-\sum X_i/n)^2 – ( (1-1/n ) \sum(\sigma_i^2) ) ) / (n-1)$.

Here are three simulation scenarios where 200 $latex X_i$ values are taken from instruments of varying precision or variance $latex \sigma_i^2, i=1,2,…,200$ and where the variance of the phenomenon of interest $latex S^2=1500$. In the first scenario $latex \sigma_i^2$ are drawn from $latex [10,1500^2]$, in the second from $latex [10,1500^2 \times 3]$ and in the third from $latex [10,1500^2 \times 5]$. In each scenario the value of $latex S_2$ is estimated 1000 times taking each time another 200 realisations of $latex X_i$. The values estimated via the maximum likelihood approach are plotted in blue, and the values obtained by the alternative method are plotted in red. The true value of the $latex S^2$ is given by the red dashed line across all plots.

First simulation scenario where $latex \sigma_i^2, i=1,2,…,200$ in $latex [10,1500^2]$. The values of  $latex \sigma_i^2$ plotted in the histogram to the right. The 1000 estimations of $latex S$ are shown by the blue (maximum likelihood) and red (alternative) histograms.

First simulation scenario where $latex \sigma_i^2, i=1,2,…,200$ in $latex [10,1500^2 \times 3]$. The values of $latex \sigma_i^2$ plotted in the histogram to the right. The 1000 estimations of $latex S$ are shown by the blue (maximum likelihood) and red (alternative) histograms.

First simulation scenario where $latex \sigma_i^2, i=1,2,…,200$ in $latex [10,1500^2 \times 5]$. The values of $latex \sigma_i^2$ plotted in the histogram to the right. The 1000 estimations of $latex S$ are shown by the blue (maximum likelihood) and red (alternative) histograms.

For recent advances in methods that deal with this kind of problems, you can look at:

Delaigle, A. and Hall, P. (2016), Methodology for non-parametric deconvolution when the error distribution is unknown. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78: 231–252. doi: 10.1111/rssb.12109

# Improving the precision of local community detection

Communities are everywhere. You can find groups of connected people on Facebook who went to university together, communities of people on youtube that like the same band, sets of products on amazon that tend to be purchased together, or modules of proteins that act in the same biological process in the human body. With the amount of data we can collect on these networks (facebook friendship, protein interactions, etc), you would think we can easily recover these groups. You would think wrong…

There are a lot of methods out there to find communities in networks. As a paper that compares community detection methods by Lancichinetti and Fortunato put it: “A thorough analysis of all existing methods would require years of work (during which many new methods would appear), and we do not believe it is feasible”. Thus, clearly we have not found the answer.

So what do people generally do to find these communities? Well… very different things. To explain a few of the approaches, it helps to break them down into broad categories such as global and local methods.

## Global Methods

Global community detection methods generally aim to optimized an objective function over the entire network. For example, I could decide that I will score the network partition based on a function I will call “greatness”, G. I will define the greatness of a community by the sum of the number of nodes each node is connected with that are in the community (the sum of the node in-degree). Or simply:

Here, A_ij is the adjacency matrix that contains a 1 where nodes i and j interact, and 0 otherwise. The sigmas in the equation correspond to the community identifiers of nodes i and j respectively, and thus the delta function is only 1 when sigma_i and sigma_j are equal, meaning nodes i and j are in the same community, and 0 otherwise.

If you now come to the conclusion that this is a pretty stupid thing to define, then you’ve understood the concept of a global method (alternatively… get rid of that negativity!). A global community detection method would now look how to partition the network so that we can maximize G. In the case of greatness defined here, everything will be put into the same community, as there’s no penalty for putting two nodes together that shouldn’t belong together. Adding a further node to a community can never decrease the greatness score!

If you want a real overview over these and other methods, check out this 100-page review article.

## Local Methods

In contrast to global methods that attempt to gather information on the whole network, local community detection methods dive into the network to try to optimize the community partitioning. This means on the one hand they can be very fast at assigning nodes to communities, as they take into account less information. However, on the other hand using less information in this assignment means they can be less precise.

An example of a local method is link clustering, which I described in an earlier post. Link clustering assesses the similarity of two edges that share a node, rather than comparing the nodes themselves. The similarity of these edges is calculated based on how similar the surroundings of the nodes on the endpoints are. Using this score, all edges that are more similar than a cutoff value, S, can be clustered together to define the edge-communities at resolution S. This partitioning translates to nodes by taking the nodes that the edges in the edge-communities connect.

## Improving local community detection

Link clustering is a pretty nice and straightforward example of how you can successfully find communities. However, the authors admit it works better on dense than on sparse networks. This is an expected side-effect of taking into account less information to be faster as mentioned above. So you can improve it with a pretty simple idea: how much more information can I take into account without basically looking at the whole network?

Local community detection methods such as link clustering are generally extended by including more information. One approach is to include information other than network topology in the local clustering process (like this). This approach requires information to be present apart from the network. In my work, I like using orthogonal data sets to validate community detection, rather than integrate it into the process and be stuck without enough of an idea if it actually worked. So, instead I’m trying to include more of the local surroundings of the node to increase the sensitivity of the similarity measure between edges. At some point, this will blow up the complexity of the problem and turn it into a global method. So I’m now walking a line between precision and speed. A line that everyone always seems to be walking. On this line, you can’t really win… but maybe it is possible to optimize the trade-off to suit my purposes. And besides… longer algorithm run times give me more time for lunch breaks ;).

# Conformational Variation of Protein Loops

Something many structural biologists (including us here in OPIG!) are guilty of is treating proteins as static, rigid structures. In reality, proteins are dynamic molecules, transitioning constantly from  conformation to conformation. Proteins are therefore more accurately represented by an ensemble of structures instead of just one.

In my research, I focus on loop structures, the regions of a protein that connect elements of secondary structure (α-helices and β-sheets). There are many examples in the PDB of proteins with identical sequences, but whose loops have different structures. In many cases, a protein’s function depends on the ability of its loops to adopt different conformations. For example, triosephosphate isomerase, which is an important enzyme in the glycolysis pathway, changes conformation upon ligand binding, shielding the active site from solvent and stabilising the intermediate compound so that catalysis can occur efficiently. Conformational variability helps triosephosphate isomerase to be what is known as a ‘perfect enzyme’; catalysis is limited only by the diffusion rate of the substrate.

Structure of the triosephosphate isomerase enzyme. When the substrate binds, a loop changes from an ‘open’ conformation (pink, PDB entry 1TPD) to a ‘closed’ one (green, 1TRD), which prevents solvent access to the active site and stabilises the intermediate compound of the reaction.

An interesting example, especially for some of us here at OPIG, is the antibody SPE7. SPE7 is multispecific, meaning it is able to bind to multiple unrelated antigens. It achieves this through conformational diversity. Four binding site conformations have been found, two of which can be observed in its unbound state in equilibrium – one with a flat binding site, and another with a deep, narrow binding site [1].

SPE7; an antibody that exists as two different structures in equilibrium – one with a shallow binding site (left, blue, PDB code 1OAQ) and one with a deep, narrow cleft (right, green, PDB 1OCW). Complementary determining regions are coloured in each case.

So when you’re dealing with crystal structures, beware! X-ray structures are averages – each atom position is an average of its position across all unit cells. In addition, thanks to factors such as crystal packing, the conformation that we see may not be representative of the protein in solution. The examples above demonstrate that the sequence/structure relationship is not as clear cut as we are often lead to believe. It is important to consider dynamics and conformational diversity, which may be required for protein function. Always bear in mind that the static appearance of an X-ray structure is not the reality!

[1] James, L C, Roversi, P and Tawfik, D S. Antibody Multispecificity Mediated by Conformational Diversity. Science (2003), 299, 1362-1367.