Remove all LaTeX generated files

I am going to leave this here and bookmark it, because I am fed up of looking this up every time, not finding it and having to `history | fgrep rm`. To be used if you want to delete all LaTeX generated (and pdfLaTeX) files.

rm *.aux *.out *.toc *.log *.synctex.gz *.bbl *.blg *.pdf

Use at your own risk!

Progress report: Molecular Dynamics simulations of TCRpMHC

Last time I gave a presentation about different papers describing Molecular Dynamics simulations of TCRpMHC (http://blopig.com/blog/?p=648&preview=true). This time I extended this to more of my own work. The focus of this talk was about the motivation why to choose a certain system and which requirements if must fulfíl for reliable conclusions. In short they are:

1) A larger number (>100) of experimental values (e.g. some kind of binding, immunogenicity, etc) of a certain TCRpMHC system. These values should be provided by:
– the same group
– the same HLA/MHC batches
– the same binding conditions
– in the ideal case in the same manuscript

The values should NOT be from a database since different experimental groups come to extremely different results for the same complexes as several examples show e.g.:
The binding IC50 affinity values of PKYVKQNTLKLAT to DRB1*01:01 range from 1nM and 600nM (IEDB entries).

2) Structural data exactly matching the systematic experimental data mentioned above. If multiple structures are involved they should be published by:

– the same group (in the ideal case even published together with the above mentioned data by the same group)
– be contemporary

Data which fulfil the above mentioned criteria is quite hard to find since most biologists are mainly interested in some fancy, however, rather anecdotal examples and rarely in systematic data production.

Once one has found such a system a large number of Molecular Dynamics simulations can be performed which will yield systematic differences not just differences originating from random events or data bias.

Django for scientific applications

In my current work I am developing a cheminformatics tool using structural and activity data to investigate protein-ligand binding. I ~~have only ever properly used~~ love python and I listen to Saulo, so I decided to used Django to develop my application. I didn’t understand what it was and why it might be useful before I started using it but below I thought I’d discuss a few of the features that I think have been useful and might encourage others to use it.

Firstly I will outline how Django works. I wanted to download all the PDB structures for CDK2 and store the information in a data structure that is robust and easily used. We have a Target and a Protein. A Target is associated to a particular UniProt accession. Cyclin-dependent kinase 2 (CDK2) is a Target. A Protein is a set of 3D coordinates, so 1AQ1 is a Protein.

class Target(models.Model):
"""A Django model to define a given protein target"""
    UniProt = models.CharField(max_length=20,unique=True)
    InitDate = models.DateTimeField(auto_now_add=True)
    Title = models.CharField(max_length=10)

In the above Target model I have three different fields. The first field denotes the UniProt accession for the Target and is “unique”. This means that only one Target can have any given UniProt accession in my data structure. If I try to add another with the same value in the UniProt field it will throw an exception. The second field denotes the time and date that the model was created. This means I can check back to when the target was created. The third is the Title I would like to use for this, for example CDK2.

I can then make a new Target objects by:

new_target = Target()
new_target.Title = "CDK2"
new_target.UniProt = "P24941"

and save it to the database by:

new_target.save() # Django takes care of the required SQL

The next model is for the Protein molecules:

class Protein(models.Model):
    """A Django model to define a given protein"""
    Code = models.CharField(max_length=6,unique=True)
    InitDate = models.DateTimeField(auto_now_add=True)
    TargetID = models.ForeignKey(Target)
    Apo = models.BoolenField()
    PDBInfo = models.FileField(upload_to='pdb')

The model contains the PDB Code, e.g. 1AQ1, and the date it was added to the database. It also consists of a foreign key, relating it to its Target and a boolean indicating if the structure is apo or holo. Finally there is a file field relating this entry to the appropriate file path where the PDB information is stored.

Once the data has been added to the database, Django then deals with all SQL queries from the database:

my_prot = Protein.objects.get(Code="1aq1") # Gives me the Protein object "1aq1"
CDK2_prots = Protein.objects.filter(TargetID__Title="CDK2") # All PDB entries associated to CDK2, as a query set, behaving similarily to a list
CDK2_list = [x for x in CDK2_prots] # Now exactly like a list

The “__” in the above query allows one to span the foreign key relationship, so it is searching for the Title of the Target not the Title of the Protein. Finally I can then access the PDB files for each of these proteins.

my_prot = Protein.objects.get(Code="1aq1") # Gives me the Protein object "1aq1"
print my_prot.Code # prints "1aq1"
# my_prot.PDBInfo has the behaviour of a file handle
pdb_lines = my_prot.PDBInfo.readlines()# Reads the lines of the file

There, you’ve made a queryable database, where Django deals with all the hard stuff and everything is native to python. Obviously in this example it might not be so difficult to imagine alternative ways of creating the same thing using directory structures, but as the structure of your data becomes more complex, Django can be easily manipulated and as it grow it utilises the speed advantages of modern databases.

What is a kink?

If you ask someone who works with membrane protein structures what they think a helix kink is, you will probably get a fairly vague answer. They would probably tell you something like a kink is a point in the helix where there is a sharp change in the direction of the helix axis. Like the helix on the right in the picture.

Two example helices

They might also tell you that they frequently (or always) involve Proline, and that they feature the absence of some of the backbone hydrogen bonds that you would otherwise expect to observe in a helix. But that is about as far as the consensus goes.

There are a number of published methods to identify kinks in helices (ProKink, Helanal, MC-Helan, and TMKink, to name a few). The detail of identifying kinks (and hence the definition of kinks) differs a lot between the methods. That would be fine if they all identified the same helices as having kinks, and the same positions in each helix as kinked, but they do not. Even the number of helices that are kinked differs a lot – anything from 6 to 60%. It would be easy if all α-helices were like the two in the figure – clearly kinked or straight. Generally the methods agree on these ‘easy’ helices. But there are lots of helices that are difficult to classify.

Why do we want to computationally identify them? Well, kinks are functionally important in membrane proteins. Membrane proteins are both medicinally important, and hard to experimentally determine the structure of. So, modelling their structures is a useful thing to do, and their structure includes kinks. But also, they are a structural motif. If we can write down what we think a kink is, use that definition to identify kinks, and then use information from those kinks we identified to (reliably) predict the existence of other kinks, we will know that we fully understand to motif.

There are a remarkably large number of perfectly sensible ways to identify a sharp change in the axis of an α-helix, given the coordinates of the backbone atoms.
The published methods (mentioned above) are a bit cumbersome (they require you to create input files, and parse at least one, if not many, output files), so from my experience people tend to make their own methods (That is not just me with Kink Finder, but many of the people who need to identify kinks that I have spoken to at conferences do not use a published method).

A short summary of ways to identify kinks:

Fit axes to short sections of helix (4-8 residues). Give each residue an angle calculated from the angle between the axis of the section before it, and the axis of the section after it. If the largest angle in the helix is greater than a threshold, annotate the helix as kinked, at the residue with the largest angle
Methods similar to (1), but using a measure other than an angle. For example, the curvature of a line fitted to the helix, or the distance between the ith and (i+4)th Cα residue
Identify sections of the helix that are ‘straight’. Where a helix contains more than on section, it is kinked
Look at them, and classify

Even for (1), there are lots of ways to calculate a helix axis. You can us least-squares (very sensitive to the number of residues used in the fit), or fitting to a cylinder (better, but slower), or a vector method using 4 Cαs. Similarly for (2), you could fit a spline, or just a polynomial. This is before you decide on (e.g.) how many residues you are going to fit axes to (6?, 7? 8?), or what smoothing parameter to use in the spline fit, or what order polynomial.

How should we classify this helix?

The point is, there are lots of ways to do it, but there is no consensus definition. You can use one method, or 2, or more, but they will give you different answers. Because we have no concise definition, it is hard to say which helix classification is ‘right’ and which one is ‘wrong’. Take this helix, it is not perfectly straight, but is it not straight enough to be kinked? Or is the change in axis over a number of turns, making it curved rather than kinked?

There are a few ways round this problem, where your computer program is struggling to ‘correctly’ classify helices. Rather than using a computer, you could manually curate a set, which has been done. However, there are a few issues here. First, their definition of a kink was a bit vague – and certainly difficult to turn into a computer program. Second, humans are very good at seeing patterns, even when they are not really there. Third, the end aim, at least for me, is to incorporate this into structure prediction, where a computational formulation of a kink is necessary. Another solution is to accept that your program is deficient, and put the borderline (i.e. difficult) cases into a third group, and removing them from your analysis. This does help to improve your signal, but is a little unsatisfactory.

To conclude, there are not two types of helices, there is more of a spectrum, from very straight, through a bit kinked, to very kinked. Classifying them is a hard problem, with lots a different approaches available. But, while the methods give different answers (particularly in identifying the position of the kink in the helix), and they do not indicate the same causes of kinks, there is work to be done.

A very long introductory post about protein structure prediction

If you are a protein informatician, bioinformatician, biochemist, biologist or simply a person well informed about science, you probably heard about protein structure prediction. If that is the case, you might be wondering what all the fuss is about, right? If you never heard those terms before, don’t panic! You are about to find out what protein structure prediction is all about!

Based on my group meeting’s presentation last Wednesday, this blog entry will discuss why protein structure prediction is important and the potential limitations of existing methods. I will also discuss how the quality of input may be a potential source for lack of accuracy in existing software.

First, let us remember a little biology: our genetic code encrypts the inner-works of a complicated cellular machinery tightly regulated by other (macro)molecules such as proteins and RNAs. These two types of macromolecules are agents that perform the set of instructions codified by DNA. Basically, RNAs and proteins are involved in a series of processes that regulate cellular function and control how the genetic code is accessed and used.

For that reason, a huge chunk of genomic data can be ~~pretty useless~~ not that useful if considered on their own. Scientists around the globe have invested millions of moneys and a huge chunk of time in order to amass piles and piles of genome sequencing data. To be fair, this whole “gotta sequence ’em all” mania did not provide us with the fundamental answers everybody was hoping for. Cracking the genetic code was like watching an episode of Lost, in which we were left with more questions than answers. We got a very complicated map that we can’t really understand just yet.

For that reason, I feel obliged to justify myself: protein structures ARE useful. If we know a protein structure, we can formulate a very educated guess about that protein’s function. Combine that with empirical data (e.g. where and when the protein is expressed) and it can help us unveil a lot of info about the protein’s role in cellular processes. Basically, it can answer some of the questions about the (genomic) map. ~~If only we could do that with Lost…~~

There is also evidence that knowing a protein’s structure can help us design specific drugs to target and inhibit that protein. Although the evidence of such biomedical application is sparse, I believe that with development of the field, there is a trend for protein structures to become more and more important in drug discovery protocols.

Still, if we look at the number of known genome sequences and known protein structures and at the growth of those figures over the past decade, we look at a drastic scenario:

There is a tendency for the gap between the number of protein sequences and protein structures to increase. Hence, we are getting more and more questions and little to no answers. Observe how the green line (the protein sequences associated with a known or predicted function) is very close to the red line (the number of known protein structures). However, there is a growing gap between the red and the blue line (the number of protein sequences). Source: http://gorbi.irb.hr/en/method/growth-of-sequence-databases/

Well, gathering protein structure data is just as important, if not more important, than gathering sequence data. This motivated the creation of Structural Genomics Consortiums (SGC), facilities that specialize in solving protein structures.

I am sorry to tell you that this is all old news. We have known this for years. Nonetheless, the graph above hasn’t changed. Why? The cost limitations and the experimental difficulties associated with protein structure determination are holding us back. Solving protein structures in the lab is hard and time consuming and we are far from being as efficient at structure determination as we are at genome sequencing.

There is a possible solution to the problem: you start with a protein sequence (a sequential aminoacid list) and you try to predict its structure. This is known as protein structure prediction or protein structure modelling. Well, we have a limited number of building blocks (20) and a good understanding of their physicochemical properties, it shouldn’t be that hard right?

Unfortunately, modelling protein structure is not as simple as calculating how fast a block slides on an inclined plane. Predicting protein structure from sequence is a very hard problem indeed! It has troubled a plethora of minds throughout the past decades, making people lose many nights of sleep (I can vouch for that).

We can attribute that to two major limitations:

1- There are so many possible ways one can combine 20 “blocks” in a sequence of hundreds of aminoacids. Each aminoacid can also assume a limited range of conformations. We are looking at a massive combinatorial problem. The conformational space (the space of valid conformations a protein with a given sequence can assume) is so large that if you could check a single conformation every nanosecond, it would still take longer than the age of the universe to probe all possible conformations.

2- Our physics (and our statistics) are inaccurate. We perform so many approximations in order to make the calculations feasible with current computers that we end up with very inaccurate models.

Ok! So now you should know what protein structure prediction is, why it is important and, more importantly, why it is such a hard problem to solve. I am going to finish off by giving you a brief overview of the two most commons approaches to perform protein structure prediction: template-based modelling (also known as homology modelling) and de novo structure prediction.

There is a general understanding that if two proteins have very similar sequences (namely, if they are homologs), than they will have similar structures. So, we can use known structures of homologs as templates to predict other structures. This is known as homology modelling.

One can do a lot of fancy talk to justify why this works. There is the evolutionary argument: “selective pressure acts on the phenotype level (which can encompass a protein structure) rather than the genotype level. Hence protein structures tend to be more conserved than sequence. For that reason and considering that sequence alone is enough to determine structure, similar sequences will have even more similar structures.”

One can also formulate some sort of physics argument: “a similar aminoacid composition will lead to a similar behaviour of the interacting forces that keep the protein structure packed together. Furthermore, the energy minimum where a certain protein structure sits is so stable that it would take quite a lot of changes in the sequence to disturb that minimum energy conformation drastically.”

Probably the best argument in favour of homology modelling is that it works somewhat well. Of course, the accuracy of the models has a strong dependency on the sequence similarity, but for proteins with more than 40% identity, we can use this method in order to obtain good results.

This raises another issue: what if we can’t find a homolog with known structure? How can we model our templateless protein sequence then? Well, turns out that if we group proteins together into families based on their sequence similarity, more than half of the families would not have a member with known structure. [This data was obtained by looking at the representativeness of the Pfam (a protein family database) on the PDB (a protein structure database).]

Ergo, for a majority of cases we have to perform predictions from scratch (known as free modelling or de novo modelling).

Well, not necessarily from scratch. There is a specific approach to free modelling where we can build our models using existing knowledge. We can use chunks of protein, contiguous fragments extracted from known structures, to generate models. This is known as a fragment-based approach to de novo protein structure prediction. And that is one big name!

One can think of this as a small scale homology modelling, where both the physics and evolutionary arguments should still hold true to some degree. And how do we do? Can we generate good models? We perform appallingly! Accuracies are too low to generate any useful knowledge in a majority of cases. The problem with the rare cases when you get it right is that you have no means to know if you actually got the right answer.

The poor quality of the results can be justified by the 2 biggest limitations discussed above. Yet something else might be in play. In homology modelling, if you use a bad template, you will most certainly get a bad model. In a similar way, using a bad set of fragments will lead you to a very poor final model.

Considering we already have the other two big issues (size of conformational space and accuracy of current potentials) to worry about, we should aim to use the best fragment library we possibly can. This has been the recent focus of my work. An attempt to make a small contribution to solve such a hard problem.

I would love to detail my work on finding better fragments here, but I believe this post is already far too long for anyone to actually endure it and read it until the end. So, congratulations if you made it through!

ISMB/ECCB Conference 2013 (Berlin)

It’s that time of the year again… when an intrepid group of OPIGlets trundle back tired but happy from another successful conference (this time it was ISMB/ECCB and its satellite conference 3Dsig in Berlin) armed with their favourite titbits from the presented work. This blog post is a mashup of some of our highlights as presented at the last group meeting.

Post-schnitzel and out and about in Berlin!

Definitely one of the best things for me was getting the chance to hear Sir Tom Blundell (our very own academic grandfather… Charlotte’s supervisor) give the keynote at 3Dsig, talking about everything from the structure of insulin to his deep, underlying love of jazz. Here are some more of our favourite things…

Empirical contact potentials derived from binding free energy changes upon mutation
(poster presentation by Iain H. Moal and Juan Fernández Racio)

Chosen by Jinwoo Leem

I was impressed by Moal (et al.)’s poster on predicting protein-protein binding affinities (in fact, it won the poster prize at 3D-Sig!). The poster describes a statistical potential that considers the number of mutations in a protein, and the type of interatomic contacts. Two variants of the potential were made; one for considering all atoms (atomic potential), and one considering residue side chains, represented as a centroid atom (residue potential). Ultimately, the energy change is represented as:

where N is the matrix of interatomic contacts between atoms i,j and P is a vector of contact types. Using weighted least-squares to minimise the residuals, r, the equation was used to predict affinity (ΔG) and affinity changes following mutations (ΔΔG).

As we can see in the top two graphs, the model shows decent performance for predicting ΔΔG of enzyme-inhibitor interactions, i.e. the model can indicate how a mutation affects binding affinities. Having said this, the ΔΔG predictions for Ab-Ag interactions were poor (Pearson’s correlation = 0.5-0.6).

Moreover, when the same potential was used to predict ΔG (bottom two graphs), the correlations were even worse. In fact, for flexible protein pairs, i.e. receptor-ligand pairs whose interface RMSD > 1.0Å, the correlation has gone to as low as 0.02.

Although the results are disappointing with respect to ΔG prediction, the model raises two interesting points. First, this is one of the few scoring functions that are specifically designed to predict affinity, rather than giving an arbitrary score for low RMSD. In addition, this model re-iterates the challenges in predicting Ab-Ag interactions. The solution for the latter point is not yet clear, but it may be of interest to re-train the model specifically with Ab-Ag complexes, and see if the model’s performance improves!

Predicting protein contact map using evolutionary and physical constraints by integer programming
(paper presentation by Zhiyong Wang and Jinbo Xu)

Chosen by Saulo de Oliveira

Last week, I decided to present a quick overview of a Paper Presentation I attended during the ISMB 2013.

The title of the presentation was “Predicting protein contact map using evolutionary and physical constraints by integer programming.” based on a paper by the same name.

Contact prediction (or evolutionary constraint prediction, a term I am much more fond of) was a trendy topic both at the 3DSig (2013) and at the ISMB (2013), with several presentations and posters on the subject.

In this particular presentation, Zhiyong Wang and Jinbo Xu described a new method to identify evolutionary constraints. The big differential of their talk and their work was approaching the problem in a different angle: their aim was to predict contacts when you have a low number of sequences in the multiple sequence alignment (refer to previous posts in the blog for an introduction to contact prediction).

They proposed a combination of machine learning and integer programming (similar to linear programming, again a topic we discussed previously here) to perform their predictions.

The features of the machine learning did not present any innovation. They were quite standard in the field such as mutation rates on PSIBLAST profiles and the Mutual Information (MI). The results of the Random Forest algorithm was employed to formulate constraints in a linear problem. These constraints were used to enforce physical properties of proteins, based mostly on our understanding of secondary structure.

Results seemed positive in both a random test set (CASP10) and 2 other test sets. By positive, I mean there was an improvement on the current state-of-the-art, especially for proteins with 10-1000 sequences in the MSA. Still, their precision was around 30, 40% for the top L/10 predictions (where L is the protein length). Further improvements are still necessary before we can apply these evolutionary constraints to improve protein structure prediction.

Evolution of drug resistance: structural models
(presentation by Maria Safi)

Chosen by Hannah Edwards

I found this talk by Maria Safi (which won the prize for best non-keynote presentation at 3Dsig) to be a really interesting method, despite my complete lack of background knowledge in the area (what are conferences for but to expose you to new ideas, right?).

Their idea was to produce a computationally viable method for identifying drug resistance in a protein’s mutant space. Drugs work by binding to their target protein in such a way as to inhibit its native function. If the protein mutates so as to maintain its native function but impair its binding to the drug it acquires resistance. The problem is, even within a reasonable distance of the native sequence, a proteins’ mutant space is huge, and it’s by no means trivial to test for maintenance of function and binding energy.

The groups’ solution was to recognise that the vast majority of mutant space would not be of interest. As such they send their candidate mutants through a 2-pass search: the first, a quick and easy algorithm to swiftly eliminate the dead end mutants… those that either are not resistant to drug binding or do not maintain their native function, and the second, a more biochemically accurate yet computationally expensive algorithm to be applied to the shortlist identified during the first pass.

The first algorithm is based on restricted dead-end elimination which aims to minimise a simple energy potential based on the protein’s structural stability and it’s binding energy to the drug. The algorithm keeps the backbone structure constant but by differing the side chain conformations, the mutants result in different energy potentials. A mutation at residue r can then be eliminated if an alternative mutation at r will always result in a lower energy potential.

The second algorithm is based on the more sophisticated methodology of MM-PBSA, combining molecular mechanics with the Poisson-Boltzman Surface Area calculations to estimate the free energy of the compound. This final run identifies the candidate mutants.

A significant strength of their method is that it requires only the crystal structures of the drug and target protein. As a purely structural model it eliminates the need for large amounts of training data, which, for newly emerging diseases and drugs, is often impossible to have access to.

The main focus of Maria’s talk however was using these energy potentials to predict evolutionary pathways from a wild-type protein to a resistant candidate. By treating evolution as a random walk through mutant space, weighted by the energy potentials, and assuming selection pressure of resistance, they were able to computationally simulate evolutionary scenarios.

For example, Maria focussed on the ritonavir-HIV protease complex to illustrate this method. The majority of the mutants with resistance to ritonavir which have been observed in nature were predicted by the method. For the candidates that were predicted but have not been seen, further elucidation could be found from the simulated evolutionary pathways: around 60% of these candidates were not accessible under the evolutionary model.

Sequence comes to the Structural Rescue: Identifying Relevant Protein Interfaces in Crystal Structures
(presentation by Jose M. Duarte)

Chosen by Henry Wilman

Jose Duarte presented a tool, EPPIC, which identifies and classifies protein interfaces from pdb structures. The talk was titled ‘Sequence comes to the Structural Rescue: Identifying Relevant Protein Interfaces in Crystal Structures’, and follows from their publication Protein interface classification by evolutionary analysis, Duarte JM, Srebniak A, Schärer MA, Capitani G. BMC Bioinformatics. 2012 Dec 22..

As the title suggests, this uses both structural and sequence information to classify protein contacts as biological or crystal. There is a webserver, and a downloadable version. There are a number of methods that exist to classify interfaces, and this differs in a few ways.

The previous methods typically rely on the area of the interface. As you see in the paper, even the benchmark sets used to test the other methods are biased such that biological contacts have much greater areas than crystal contacts. When the authors constructed a set where the contact area was similar, they found the previous methods performed generally poorly. However, there are a number of ways that you can define the interface or contact area, and specifically what people call ‘core residues’ of the interface. They found one study performed much better on their training set than the others. This defined core residues as ones that lost the majority of their solvent accessible surface area on binding to the interface. A simple cut off of >= 6 core residues at an interface produced a good classification.

In addition to this, they used sequence information. We know that interfaces are important, and often mutations at interface residues are bad. So, for a biological interface, we would expect residues to be better conserved than non-interacting surface residues. The authors used sequence entropy as a measure of the conservation. They calculated this by collecting homologous sequences with PSI-Blast and aligned them using Clustal-Omega. For each position in the alignment, if x is the occupancy frequency for a given amino acid, the sequence entropy is given by the sum over all amino acids of xlog(x). (They actually use a reduced alphabet for this, to avoid penalising mutations to similar amino acids). They then compare the entropy of the ‘core’ residues in the interface to those on the surface of the protein, and those on the periphery of the interface. If the core residues have lower entropy, then the contact is classed as biological. There are simple thresholds for both of these comparisons.

They have three metrics – one structural (number of core residues), and two sequence (entropy of core residues vs. peripheral residues, and entropy of core residues vs. surface residues). They classify based on a majority vote of the three methods. If there are an insufficient number of homologous sequences (i.e. fewer than 8), then they ignore the sequence scores, and classify using the structure only.

So why do we care about protein interfaces? Lots of us work with experimental protein structures. Many of these come from X-ray crystallography experiments. This means that when the structural information is captured, the protein is not isolated – instead it is packed against many other copies of itself. A bit like a brick in a wall – a wall many rows of bricks deep. So our protein is in contact with many others. Some of these contacts occur within the natural environment of the cell, others are a result of the crystal packing structure.
Now, protein interfaces are important. ‘Why?’, I hear you ask. Interfaces are how proteins interact with each other, and with other molecules. Some interfaces are large, some small, some are involved in transient interactions, others in permanent ones. Many diseases occur due to amino acid mutations at these interfaces. These change how the protein functions, which can cause bad things to happen. Similarly, if we can learn how to interact with an interface, we are (theoretically) able to alter the function with some sort of drug, and cause good things (whilst avoiding bad things).

So, this raises a few questions for people who study protein crystal structures. Things like, which bits of my protein interact? Do these interactions happen in a cell? Is my structure somehow distorted by the crystal packing? This EPPIC tool gives us one way to answer these.

Congratulations on reaching the end of this blog post… as a reward, the epic Brandenburg gate (taken by Jin)

Every Protein needs a Friend – Community Detection in Protein Interaction Networks

To make the OPIG soup, that has tasted of antibodies a lot lately, a little more diverse, I will try to spice things up with a dash of protein interaction networks, a pinch of community detection and a shot of functional similarity evaluation. I hope it remains edible!

In the 10 weeks I have spent at OPIG, my main focus has been on protein interaction networks, or more specifically, on this network:

View of the largest connected component of the HINT binary physical interaction network. Nodes represent proteins and edges are protein interactions.

Viewing this image, a popular German phrase comes to mind, which badly translated means: “As you see, you see nothing”. However, trying to “see” something in this, is what I’ve been trying to do. And as it turns out, I’m not the only person.

If we had a data set which says exactly which protein interacts with which other ones, then surely all biological pathway information must be incorporated in this data, and we should be able to cluster it into smaller modules or communities, which represent a biological function. This Gedankenexperiment is the theory which underlies my approach to these networks.

In reality, however, we don’t have this perfect data set. Protein interaction networks are very noisy with high estimated false positive and false negative rates for interactions, yet community detection algorithms have still been shown to be successful in outputting meaningful partitions of the network into communities. In this context “meaningful” refers to communities which group proteins together that have a similar biological function.

This brings us to a whole new problem. What is a “similar biological function” and how do you measure it? This question cannot be perfectly answered, but it seems the Gene Ontology annotations for biological process are a good place to start. In this framework, proteins are annotated with terms which describe the biological process they participate in. Of course there is not always a consensus about what term is to be assigned to a protein, and it is questionable how precisely a protein’s function within a process can be determined, but it wouldn’t be called work, if it was easy.

In my 10 weeks here, I’ve only scraped the tip of what is detection of functional communities in protein interaction networks, but it looks promising that the communities obtained may have some significance regarding biological modules. It is my hope that I can use data sets such as gene expression studies to further investigate this significance in the future, and maybe, if I’m very lucky, work towards helping people classify macrophage phenotypes or identify cancer in the distant future. The best place to do this, would definitely be in the friendly atmosphere that is OPIG!

[Database] SAbDab – the Structural Antibody Database

An increasing proportion of our research at OPIG is about the structure and function of antibodies. Compared to other types of proteins, there is a large number of antibody structures publicly available in the PDB (approximately 1.8% of structures contain an antibody chain). For those of us working in the fields of antibody structure prediction, antibody-antigen docking and structure-based methods for therapeutic antibody design, this is great news!

However, we find that these data are not in a standard format with respect to antibody nomenclature. For instance, which chains are “heavy” chains and which are “light“? Which heavy and light chains pair? Is there an antigen present? If so, to which H-L pair does it bind to? Which numbering system is used … etc.

To address this problem, w e have developed SAbDab: the Structural Antibody Database. Its primary aim is for easy creation of antibody structure and antibody-antigen complex datasets for further analysis by researchers such as ourselves. These sets can be selected using a number of criteria (e.g. experimental method, species, presence of constant domains…) and redundancy filters can be applied over the sequences of both the antibody and antigen. Thanks to Jin, SAbDab now also includes associated curated affinity (Kd) values for around 190 antibody-antigen complexes. We hope this will serve as a benchmarking tool for antibody-antigen docking prediction algorithms.

Alternatively, the database can be used to inspect and compare properties of individual structures. For instance, we have recently published a method to characterise the orientation between the two antibody variable domains, VH and VL. Using the ABangle tool, users can select structures with a particular VH-VL orientation, visualise and quantify conformational changes (e.g. between bound and unbound forms) and inspect the pose of structures with certain amino acids at specific positions. Similarly, the CDR (complimentary determining region) search and clustering tools, allow for the antibody hyper-variable loops to be selected by length, type and canonical class and their structures visualised or downloaded.

SAbDab also contains features such as the template search. This allows a user to submit the sequence of either an antibody heavy or light chain (or both) and to find structures in the database that may offer good templates to use in a homology modelling protocol. Specific regions of the antibody can be isolated so that structures with a high sequence identity over, for example, the CDR H3 loop can be found. SAbDab’s weekly automatic updates ensures that it contains the latest available data. Using each method of selection, the structure, a standardised and re-numbered version of the structure, and a summary file containing information about the antibody, can be downloaded both individually or en-masse as a dataset. SAbDab will continue to develop with new tools and features and is freely available at: opig.stats.ox.ac.uk/webapps/sabdab.

Research Talk: High Resolution Antibody Modelling

In keeping with the other posts in recent weeks, and providing a certain continuity, this post also focusses on antibodies. For those of you that have read the last few weeks’ posts, you may wish to skip the first few paragraphs, otherwise things may get repetitive…

Antibodies are key components of the immune system, with almost limitless potential variability. This means that the immune system is capable of producing antibodies with the ability to bind to almost any target. Antibodies exhibit very high specificity and very high affinity towards their targets, and this makes them excellent at their job – of marking their targets (antigens) to identify them to the rest of the immune system, either for modification or destruction.

(left) The Immunoglobulin (IgG) fold, the most common fold for antibodies. It is formed of four chains, two heavy and two light. The binding regions of the antibody are at the ends of the variable domains VH and VL, located at the ends of the heavy and light chains respectively. (right) The VH domain. At the end of both the VH and the VL domains are three hypervariable loops (CDRs) that account for most of the structural variability of the binding site. The CDRs are highlighted in red. The rest of the domain (coloured in cyan), that is not the CDRs, is known as the framework.

Over the past few years, the use of antibodies as therapeutic agents has increased. It is now at the point where we are beginning to computationally design antibodies to bind to specific targets. Whether they are designed to target cancer cells or viruses, the task of designing the CDRs to complement the antigen perfectly is a very difficult one. Computationally, the best way of predicting the affinity of an antibody for an antigen is through the use of docking programs.

For best results, high resolution, and very accurate models of both the antibody and the antigen are needed. This is because small changes in the antibodies sequence can be seen to produce large changes in the affinity, experimentally.

Many antibody modelling protocols currently exist, including WAM, PIGS, and RosettaAntibody. These use a variety of approaches. WAM and PIGS use homology modelling approaches to model the framework, augmented with expert knowledge-based rules to model the CDRs. RosettaAntibody also uses homology modelling to model the framework of the antibody, but then uses the Rosetta protocol to perform an exploration of the conformational space to find the lowest energy conformation.

However, there are several problems that remain. The orientation between the VH domain and the VL domain is shown to be instrumental in the high binding affinity of the antibody. Mutations to framework residues that change the orientation of the VH and VL domains have been shown to cause significant changes to the binding affinity.

Because of the multi-chain modelling problem, which currently has no general solution, the current approach is often to copy the orientation across from the template antibody to create the orientation of the target antibody. (The three examples above do perform some extent of orientation optimisation using conserved residues at the VH-VL interface.)

However, before we begin to consider how to effect the modelling of the VH-VL interface, we must first build the VH and the VL separately. All of the domain folds in the IgG structure are very similar, consisting of two anti-parallel beta sheets sandwiched together. These beta sheets are very well conserved. The VH domain is harder to model because it contains the CDR H3 – which is the longest and most structurally variable of the 6 CDRs – so we may as well start there…

Framework structural alignment of 605 non-redundant structures (made non-redundant @95% sequence identity). The beta sheet cores are very well conserved, but the loops exhibit more structural variability (although not that much by general protein standards...). The stumps where the CDRs have been removed are shown.

Framework structural alignment of 605 non-redundant VHs (made non-redundant @95% sequence identity). The beta sheet cores are very well conserved, but the loops exhibit more structural variability (although not that much by general protein standards…). The stumps where the CDRs have been removed are labelled.

But even before we start modelling the VH, how hard is the homology modelling problem likely to be for the average VH sequence that we come across? Extracting all of the VH sequences from the IMGT database (72,482 sequences) we find the structure in SAbDab (Structural Antibody Database) that exhibits the highest sequence identity to each of the sequences. This is the structure that would generally be used as the template for modelling. Results below…

Most of the sequences have a best template with over 70% sequence identity, so modelling them with low RMSDs (< 1 Angstrom) should be possible. However, there are still those that have lower sequence identity. These could be problematic…

When we are analysing the accuracy of our models, we often generate models for which we have experimentally derived crystal structures, and then compare them. But a crystal structure is not necessarily the native conformation of the protein, and some of the solvents added to aid the crystallisation could well distort the structure in some small (or possibly large) way. Or perhaps the protein is just flexible, and so we wouldn’t expect it to adopt just one conformation.

Again using SAbDab to help generate our datasets, we found the maximum variation (backbone RMSD) between sequence-identical VH domains, for the framework region only. How different can 100% identical sequences get? Again, results are below…

We see that even for 100% identical domains, the conformations can be different enough for a significant RMSD. The change that created a 1.4A RMSD change (PDB entries 4fqc and 4fq1) is due to a completely different conformation for one of the framework loops.

So, although antibody modelling is easy in some respects – high conservation, large number of available structures for templates – it is not just a matter of getting it ‘close’, or even ‘good’. It’s about getting it as near to perfect as possible… (even though perfect may be ~ 0.4 A RMSD over the framework…)

Watch this space…

“Perfection is not attainable, but if we chase perfection we can catch excellence.”

(Vince Lombardi )

Oxford Protein Informatics Group

or "OPIG" to friends