Monthly Archives: July 2013

Every Protein needs a Friend – Community Detection in Protein Interaction Networks

To make the OPIG soup, that has tasted of antibodies a lot lately, a little more diverse, I will try to spice things up with a dash of protein interaction networks, a pinch of community detection and a shot of functional similarity evaluation. I hope it remains edible!


In the 10 weeks I have spent at OPIG, my main focus has been on protein interaction networks, or more specifically, on this network:

View of the largest connected component of the HINT binary physical interaction network

View of the largest connected component of the HINT binary physical interaction network. Nodes represent proteins and edges are protein interactions.

Viewing this image, a popular German phrase comes to mind, which badly translated means: “As you see, you see nothing”. However, trying to “see” something in this, is what I’ve been trying to do. And as it turns out, I’m not the only person.

If we had a data set which says exactly which protein interacts with which other ones, then surely all biological pathway information must be incorporated in this data, and we should be able to cluster it into smaller modules or communities, which represent a biological function. This Gedankenexperiment is the theory which underlies my approach to these networks.

In reality, however, we don’t have this perfect data set. Protein interaction networks are very noisy with high estimated false positive and false negative rates for interactions, yet community detection algorithms have still been shown to be successful in outputting meaningful partitions of the network into communities. In this context “meaningful” refers to communities which group proteins together that have a similar biological function.

This brings us to a whole new problem. What is a “similar biological function” and how do you measure it? This question cannot be perfectly answered, but it seems the Gene Ontology annotations for biological process are a good place to start. In this framework, proteins are annotated with terms which describe the biological process they participate in. Of course there is not always a consensus about what term is to be assigned to a protein, and it is questionable how precisely a protein’s function within a process can be determined, but it wouldn’t be called work, if it was easy.

In my 10 weeks here, I’ve only scraped the tip of what is detection of functional communities in protein interaction networks, but it looks promising that the communities obtained may have some significance regarding biological modules. It is my hope that I can use data sets such as gene expression studies to further investigate this significance in the future, and maybe, if I’m very lucky, work towards helping people classify macrophage phenotypes or identify cancer in the distant future. The best place to do this, would definitely be in the friendly atmosphere that is OPIG!

[Database] SAbDab – the Structural Antibody Database

An increasing proportion of our research at OPIG is about the structure and function of antibodiesCompared to other types of proteins, there is a large number of antibody structures publicly available in the PDB (approximately 1.8% of structures contain an antibody chain). For those of us working in the fields of antibody structure prediction, antibody-antigen docking and structure-based methods for therapeutic antibody design, this is great news!

However, we find that these data are not in a standard format with respect to antibody nomenclature. For instance, which chains are “heavy” chains and which are “light“? Which heavy and light chains pair? Is there an antigen present? If so, to which H-L pair does it bind to? Which numbering system is used … etc.

To address this problem, we have developed SAbDab: the Structural Antibody Database. Its primary aim is for easy creation of antibody structure and antibody-antigen complex datasets for further analysis by researchers such as ourselves. These sets can be selected using a number of criteria (e.g. experimental method, species, presence of constant domains…) and redundancy filters can be applied over the sequences of both the antibody and antigen. Thanks to Jin, SAbDab now also includes associated curated affinity (Kd) values for around 190 antibody-antigen complexes. We hope this will serve as a benchmarking tool for antibody-antigen docking prediction algorithms.


Alternatively, the database can be used to inspect and compare properties of individual structures. For instance, we have recently published a method to characterise the orientation between the two antibody variable domains, VH and VL. Using the ABangle tool, users can select structures with a particular VH-VL orientation, visualise and quantify conformational changes (e.g. between bound and unbound forms) and inspect the pose of structures with certain amino acids at specific positions. Similarly, the CDR (complimentary determining region) search and clustering tools, allow for the antibody hyper-variable loops to be selected by length, type and canonical class and their structures visualised or downloaded.



SAbDab also contains features such as the template search. This allows a user to submit the sequence of either an antibody heavy or light chain (or both) and to find structures in the database that may offer good templates to use in a homology modelling protocol. Specific regions of the antibody can be isolated so that structures with a high sequence identity over, for example, the CDR H3 loop can be found. SAbDab’s weekly automatic updates ensures that it contains the latest available data. Using each method of selection, the structure, a standardised and re-numbered version of the structure, and a summary file containing information about the antibody, can be downloaded both individually or en-masse as a dataset. SAbDab will continue to develop with new tools and features and is freely available at:

Research Talk: High Resolution Antibody Modelling

In keeping with the other posts in recent weeks, and providing a certain continuity, this post also focusses on antibodies. For those of you that have read the last few weeks’ posts, you may wish to skip the first few paragraphs, otherwise things may get repetitive…

Antibodies are key components of the immune system, with almost limitless potential variability. This means that the immune system is capable of producing antibodies with the ability to bind to almost any target. Antibodies exhibit very high specificity and very high affinity towards their targets, and this makes them excellent at their job – of marking their targets (antigens) to identify them to the rest of the immune system, either for modification or destruction.

Immunoglobulin G (IgG) Structure

(left) The Immunoglobulin (IgG) fold, the most common fold for antibodies. It is formed of four chains, two heavy and two light. The binding regions of the antibody are at the ends of the variable domains VH and VL, located at the ends of the heavy and light chains respectively. (right) The VH domain. At the end of both the VH and the VL domains are three hypervariable loops (CDRs) that account for most of the structural variability of the binding site. The CDRs are highlighted in red. The rest of the domain (coloured in cyan), that is not the CDRs, is known as the framework.

Over the past few years, the use of antibodies as therapeutic agents has increased. It is now at the point where we are beginning to computationally design antibodies to bind to specific targets. Whether they are designed to target cancer cells or viruses, the task of designing the CDRs to complement the antigen perfectly is a very difficult one. Computationally, the best way of predicting the affinity of an antibody for an antigen is through the use of docking programs.

For best results, high resolution, and very accurate models of both the antibody and the antigen are needed. This is because small changes in the antibodies sequence can be seen to produce large changes in the affinity, experimentally.

Many antibody modelling protocols currently exist, including WAM, PIGS, and RosettaAntibody. These use a variety of approaches. WAM and PIGS use homology modelling approaches to model the framework, augmented with expert knowledge-based rules to model the CDRs. RosettaAntibody also uses homology modelling to model the framework of the antibody, but then uses the Rosetta protocol to perform an exploration of the conformational space to find the lowest energy conformation.

However, there are several problems that remain. The orientation between the VH domain and the VL domain is shown to be instrumental in the high binding affinity of the antibody. Mutations to framework residues that change the orientation of the VH and VL domains have been shown to cause significant changes to the binding affinity.

Because of the multi-chain modelling problem, which currently has no general solution, the current approach is often to copy the orientation across from the template antibody to create the orientation of the target antibody. (The three examples above do perform some extent of orientation optimisation using conserved residues at the VH-VL interface.)

However, before we begin to consider how to effect the modelling of the VH-VL interface, we must first build the VH and the VL separately. All of the domain folds in the IgG structure are very similar, consisting of two anti-parallel beta sheets sandwiched together. These beta sheets are very well conserved. The VH domain is harder to model because it contains the CDR H3 – which is the longest and most structurally variable of the 6 CDRs – so we may as well start there…

Framework structural alignment of 605 non-redundant structures (made non-redundant @95% sequence identity). The beta sheet cores are very well conserved, but the loops exhibit more structural variability (although not that much by general protein standards...). The stumps where the CDRs have been removed are shown.

Framework structural alignment of 605 non-redundant VHs (made non-redundant @95% sequence identity). The beta sheet cores are very well conserved, but the loops exhibit more structural variability (although not that much by general protein standards…). The stumps where the CDRs have been removed are labelled.

But even before we start modelling the VH, how hard is the homology modelling problem likely to be for the average VH sequence that we come across? Extracting all of the VH sequences from the IMGT database (72,482 sequences) we find the structure in SAbDab (Structural Antibody Database) that exhibits the highest sequence identity to each of the sequences. This is the structure that would generally be used as the template for modelling. Results below…



Most of the sequences have a best template with over 70% sequence identity, so modelling them with low RMSDs (< 1 Angstrom) should be possible. However, there are still those that have lower sequence identity. These could be problematic…

When we are analysing the accuracy of our models, we often generate models for which we have experimentally derived crystal structures, and then compare them. But a crystal structure is not necessarily the native conformation of the protein, and some of the solvents added to aid the crystallisation could well distort the structure in some small (or possibly large) way. Or perhaps the protein is just flexible, and so we wouldn’t expect it to adopt just one conformation.

Again using SAbDab to help generate our datasets, we found the maximum variation (backbone RMSD) between sequence-identical VH domains, for the framework region only. How different can 100% identical sequences get? Again, results are below…


We see that even for 100% identical domains, the conformations can be different enough for a significant RMSD. The change that created a 1.4A RMSD change (PDB entries 4fqc and 4fq1) is due to a completely different conformation for one of the framework loops.

So, although antibody modelling is easy in some respects – high conservation, large number of available structures for templates – it is not just a matter of getting it ‘close’, or even ‘good’. It’s about getting it as near to perfect as possible… (even though perfect may be ~ 0.4 A RMSD over the framework…)

Watch this space…

“Perfection is not attainable, but if we chase perfection we can catch excellence.”

(Vince Lombardi )

Citing R packages in your Thesis/Paper/Assignments

If you need to cite R, there is a very useful function called citation().

> citation()

To cite R in publications use:

  R Core Team (2013). R: A language and environment for statistical
  computing. R Foundation for Statistical Computing, Vienna, Austria.

A BibTeX entry for LaTeX users is

    title = {R: A Language and Environment for Statistical Computing},
    author = {{R Core Team}},
    organization = {R Foundation for Statistical Computing},
    address = {Vienna, Austria},
    year = {2013},
    url = {},

We have invested a lot of time and effort in creating R, please cite it
when using it for data analysis. See also ‘citation("pkgname")’ for
citing R packages.

If you want to cite just a package, just pass the package name as a parameter, e.g.:

> citation(package = "cluster")

To cite the R package 'cluster' in publications use:

  Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik,
  K.(2013).  cluster: Cluster Analysis Basics and Extensions. R package
  version 1.14.4.

A BibTeX entry for LaTeX users is

    title = {cluster: Cluster Analysis Basics and Extensions},
    author = {Martin Maechler and Peter Rousseeuw and Anja Struyf and Mia Hubert and Kurt Hornik},
    year = {2013},
    note = {R package version 1.14.4 --- For new features, see the 'Changelog' file (in the package source)},

This will give you BibTeX entries you can copy and paste in your BibTeX reference.  If you are using M$ Word, good luck to you.

[Publication] Cloud computing in Molecular Modelling – a topical perspective


My ex-InhibOx colleagues (Simone Fulle, Garrett Morris, Paul Finn) and myself have recently published a topical review on “The emerging role of cloud computing in molecular modelling” in the Journal of Molecular Graphics and Modelling.   This paper starts with a gentle and in-depth introduction to the field of cloud computing.  The second part of the paper is how it applies to molecular modelling (and the sort of tasks we can run in the cloud).  The third and last part presents two practical case studies of cloud computations, one of which describes how we built a virtual library to use in virtual screening on AWS.

We hope that after reading this article the cloud will become a less nebulous affair! *pun intended*

As an addendum, I recently came across this paper “Teaching cloud computing: A software engineering perspective” (2013) on how to teach cloud computing at a graduate level.  This work is relevant, because lots of universities are presently including cloud computing in their curricula.


Making Protein-Protein Interfaces Look (decently) Good

This is a little PyMOL script that I’ve used to draw antibody-antigen interfaces. If you’d like a commented version on what each and every line does, contact me! This is a slight modification of what has been done in PyMOL Wiki.

set_name FILENAME, complex	

set bg_rgb, [1,1,1]  	

color white 	     		

hide lines
show cartoon

select antibody, chain a
select antigen, chain b

select paratopeAtoms, antibody within 4.5 of antigen 
select epitopeAtoms, antigen within 4.5 of antibody

select paratopeRes, byres paratopeAtoms
select epitopeRes, byres epitopeAtoms

distance interactions, paratopeAtoms, epitopeAtoms, 4.5, 0

color red, interactions
hide labels, interactions

show sticks, paratopeRes
show sticks, epitopeRes

set cartoon_side_chain_helper, on

set sphere_quality, 2
set sphere_scale, 0.3
show spheres, paratopeAtoms
show spheres, epitopeAtoms
color tv_blue, paratopeAtoms
color tv_yellow, epitopeAtoms

set ray_trace_mode, 3
unset depth_cue
set specular, 0.5

Once you orient it to where you’d like it and ray it, you should get something like this.

Building an Antibody Benchmark Set

In this so-called ‘big data’ age, the quest to find the signal amidst the noise is becoming more difficult than ever. Though we have sophisticated systems that can extract and parse data incredibly efficiently, the amount of noise has equally, if not more so, expanded, thus masking the signals that we crave for. Oddly enough, it sometimes seems that we are churning and gathering a vast amount data just for the sake of it, rather than looking for highly-relevant, high-quality data.

One such example is antibody (Ab) binding data. Even though there are several Ab-specific databases (e.g. AbySis, IMGT), none of these, to our knowledge, has any information on an Ab’s binding affinity to its antigen (Ag), despite the fact that an Ab’s affinity is one of the few quantitative metrics of its performance. Therefore, gathering Ab binding data would not only help us to create more accurate models of Ab binding, it would, in the long term, facilitate the in silico maturation and design/re-design of Abs. If this seems like a dream, have a read of this paper – they made an incredibly effective Ab from computationally-inspired methods.

Given the tools at our disposal, and the fact that several protein-protein binding databases are available in the public domain, this task may seem somewhat trivial. However, there’s the ever-present issue of gathering only the highest quality data points in order to perform some of the applications mentioned earlier.

Over the past few weeks, we have gathered the binding data for 228 Ab-Ag complexes across two major protein-protein binding databases; PDB-Bind and the structure-based benchmark from Kastritis et al. Ultimately, 36 entries were removed from further analyses as they had irrelevant data (e.g. IC50 instead of KD; IC50 relates to inhibition, which is not the same as the Ab’s affinity for its Ag). Given the dataset, we performed some initial tests on existing energy functions and docking programs to see if there is any correlation between the programs’ scores and protein binding affinities.

Blue = Abs binding to proteins, Red = Abs binding to peptides

Blue = Abs binding to proteins, Red = Abs binding to peptides

As the graphs show, there is no distinctive correlation between a program/function’s score and the affinity of an Ab. Having said this, these programs were trained on general protein-protein interfaces (though that does occasionally include Abs!) and we thus trained DCOMPLEX and RAPDF specifically for Ab structures (~130 structures). The end results were poor nonetheless (top-centre and top-right graphs, above), but the interatomic heatmaps show clear differences in the interaction patterns between Ab-Ag interfaces and general protein-protein interfaces.

Interatomic contact map between Ab-Ag or two general proteins. Warmer colours represent higher counts.

Interatomic contact map between Ab-Ag or two general proteins. Warmer colours represent higher counts.

Now, with this new information, the search for signals continues. It is evident that Ab binding has distinctive differences with respect to protein-protein interfaces. Therefore, the next step is to gather more high-quality data and see if there is any correlation between an Ab’s distinct binding mode and its affinity. However, we are not interested in just getting whatever affinity data is available. As we have done for the past few weeks, the rigorous standards we have used for building the current benchmark set must be maintained – otherwise we risk in masking the signal with unnecessary noise.

Currently, the results are disappointing, but if the past few weeks in OPIG has taught me anything, this is only the beginning of a long and difficult search for a good model. BUT – this is what makes research so exciting! We learn from the low Pearson correlation coefficients, the (almost) random distribution of data, and the not-so-pretty plots of our data in order to form useful models for practical applications like Ab design. I think a quote from The Great Gatsby accurately ‘models’ my optimism for making sense of the incoming stream of data:

Gatsby believed in the green light, the orgastic future that year by year recedes before us. It eluded us then, but that’s no matter — to-morrow we will run faster, stretch out our arms farther. . . . And one fine morning ——

So we beat on, boats against the current, borne back ceaselessly into the past.