Monthly Archives: January 2013

Journal club: Antibody-protein docking using asymmetric statistical potential

The second group presentation covered the antibody-antigen docking paper by Brenke et al. 2012. Before moving to the methodology and results, one has to provide a bit of motivation and fundamentals.

Why Antibodies?

Antibodies are one of the most basic lines of defense of the vertebrate organism. They consist a class of proteins whose structure allows them to adjust their binding profile (affinity and specificity) to bind virtually any protein. The source of this adjustability are the Complementaruty Determining Regions (CDRs) which typically consist of six hypervariable loops. Owing to this binding malleability they are one of the most important biomarkers and biopharmaceuticals. NB: in most cases when people talk about antibodies they mean the IgG, which is the iconic Y-shaped molecule (other classes of antibodies exist and they are different configurations of several IgG molecules, e.g. IgM is a pentamer thereof).

Why Docking?

The number of protein structures in the Protein Data Bank (PDB) is ever increasing. By analyzing these structures we can gain indispensable insights into the working of a living organism through the proxy of protein-protein interactions. The only caveat is that the number of possible complexes crystallized is far behind the number of complexes that could be formed even using the single-protein structures in the PDB alone. This gave rise to the field of protein-protein docking exemplified by tools like ZDOCK, HADDOCK, RosettaDock, ClusPro, PatchDock and many more. These methods receive two unbound proteins as input and they attempt to generate a complex between them which acts as an approximation of the native complex formed in the organism. The available methods are still far from being able to generate reliable complexes, although it appears that there is progress – as demonstrated by successive rounds of the CAPRI experiment.

Why Docking Antibodies?

There are two main classes of docking problems: enzymes and antibodies. Enzymes are considered an easier target for the algorithm because of their shape complementarity and a suitable hydrophobic pockets. Antibodies on the other hand bind proteins on flat surfaces. Furthermore, enzymes undergo correlated mutations with their binding partners over long periods of time whereas antibodies are adjusted towards their binding partner sometimes in a matter of days. Thus, Brenke at al. developed a docking method tailored specifically to the problem of antibody antigen binding.


The algorithm developed by Brenke et al. is the only currently available global rigid-body antibody protein docking algorithm. There is another method that also explicitly addresses antibodies, SnugDock, but it rather contributes high-resolution, local, flexible docking capability. Other methods like ZDOCK or PatchDock identify the binding site on the antibody and mask all the other residues that are unlikely to be interacting (although one has to keep in mind that only about 80% of the binding residues are found in the CDRs as defined by either Kabat, Chothia, Abnum or IMGT).

The ADARS algorithm follows from the earlier method, DARS. The main feature of the method is the novel way to calculate the reference state in the statistical potential equation, which forms a component of the algorithms energy function.

Screen Shot 2013-01-28 at 12.04.46 PM (1)

The function ε provides a definition of a potential between two atom types I and J (for instancehydroxyl oxygen of tyrosine and one of the phenolic ring carbons of phenylalanine). Negative value of the expression in (1) stands for attraction and positive for repulsion. The constant k stands for the Boltzmann constant whereas T for temperature. The expression pobs approximates the probability of seeing atoms I and J at an interaction distance (defined as less than 6<Å by Brenke et al.) whereas pref denotes the reference state for the interaction of those two atom types. In general pref approximates the background distribution of seeing two atom types together. Thus if pobs happens to be greater than pref, one can assume that the given pair of atoms appears within contact distance more often than expected, meaning that there is a tendency to pair them up. Calculation of a suitable value for the reference state is crucial for the success of the method.

The main contribution of DARS is the way the reference state is computed. Given a training set of protein-protein complexes, a subset of those is selected, binding partners separated and re-docked. Over a set of decoys returned for each target, the number of times a given atom pair (I,J) was observed at an interaction distance is noted. Those frequencies are denoted νIJ and are used in equation (2) to compute the value of pref for a given pair of atoms.

Screen Shot 2013-01-28 at 12.04.57 PM (2)

Here pref stands for the same pref as in equation (1) (even though capitalized). In the case of ADARS, the value of pref is calculated using a subset of the training set of antibody-antigen complexes in order to customize the method towards those molecules.

Another feature of the method is the symmetry condition. In the original DARS, it was assumed that atomic potentials are symmetric, as given in (3):

Screen Shot 2013-01-28 at 12.04.52 PM (3)

Brenke et al. note a considerable disparity in atom pairing preferences for antibodies and antigens, leading them to introduce directionality into the knowledge-based potential function by additionally specifying the molecule class the atom came from (i.e. either antibody or antigen).


Authors have used a subset of the target from the Protein-Protein Docking Benchmark 3.0 to test the performance of their asymmetric potential. They have tested three algorithms on this benchmark:

  • DARS: the original version of the potential for protein-protein docking
  • aDARS: the potential calculated using the antibody-antigen training set although still with the symmetry condition given in (3)
  • aADARS: the potential calculated using the antibody-antigen training set with the symmetry condition in (4) removed.

Authors use the Irmsd metric to assess the quality of docking (for details see the CAPRI criteria). A successful decoy has Irmsd of less than 10Å. According to this metric the decoys returned by aDARS are of better quality than those of the original DARS. Furthermore, the quality of aADARS is better than this of aDARS meaning that the asymmetry condition, which is the only distinguishing factor here, contributes to the predictive power.


Brenke et al. have developed a docking method customized for the problem of antibodies. Their algorithm provides approximate solutions to the global rigid-body docking problem. As such the answers are good enough to be able to use more high-resolution methods like SnugDock which are capable of refining the initial pose provided by ADARS.

Anatomy of a blog post

Now, shall I use the first person singular or plural to write this?  Active or passive voice? …

It doesn’t really matter.  This isn’t a formal article, and you can even use abbreviations.  This group blog, like anything else during our time in Oxford, is an experiment.  We will give it a few months and see what happens.  If it pans out, we will have a, more or less, detailed research journal for the group.  Not to mention a link with the outside world (prospective students? employers?) and proof that we can “communicate” with others. And since this is an exploratory exercise we should have freedom to explore what we want to write about.

We should have plenty of fodder.  Let us face it, if we do not do some mildly interesting science every week then we are probably not having enough fun.  But even if you are working on a hushed up, undercover project (e.g. the next blockbuster drug against Malaria) – there are still so many interesting bits of our D.Phil. which would otherwise never see the light of day.

For inspiration, have a look at other popular scientific blogs – the chembl one is both educational and humourous in equal measures (Post Idea #1: list of bio/cheminformatics blogs which every grad student should read).  Blogs are a great way to survey literature without actually doing any reading (Post Idea #2: tricks to increase grad student productivity… what do you mean you don’t use Google Alerts to surprise your supervisor with a link to a paper published the day before?); and for a TL;DR version there is twitter (Post Idea #3: Idea #1 but for twitter instead).  I only found out about the four stranded DNA in human cells by following @biomol_info.

And of course, we are mostly a computational group – so software is what we churn out on a daily basis.  How much of the software we write ends up resting forever on our disks, never to be used again.  The masses want splitchain!  (Idea #4: post software you wrote).  And there is benefit in not only giving out software, but also explaining the internals with snippets  (Idea #5: a clever algorithm explained line-by-line).

And then there is the poster you hung up once (Idea: #6) or the talk you gave and prepared for hours on your disposable, use-once-only slides (Idea: #7).  There is the announcement of publishing a paper – that solemn moment in academia when someone else thinks what you have done is worthy (Idea: #8 – btw well done to our own Jamie Hill for his recent MP-T work).

And if your an athelete, like Anna (Dr. Lewis) who crossed the atlantic in a rowing boat or Eleanor who used to row for the blues – what can I say, this is how we roll, or row [feeble attempt at humour] – thats a non-scientific but unique and interesting experience too (Idea #8).  .

If you’ve read a paper and you think it’s interesting comment on it – people will follow your posts just because it acts like a literature filter (Idea #9).  You can probably even have a rant (Idea #10); as long as its more positive and less bitter than Fred Ross’ Farewell to Bioinformatics.

Finally, this post is long and tedious for the reader.  But that is ok too – like everything else here, it is a learning experience and the more I write the more I will improve.  So hey, I’m also doing this to write a better thesis (i.e. to make the writing less painful).

An addendum; my initial intention was to discuss the bits which make a good blog post.  You can find lots of articles about this – so it is less interesting; but here are the main points


If a picture is really worth a thousand words, 30 of these is all I need for my thesis.


Journal club: Principles for designing ideal protein structures

The goal of protein design is to generate a sequence that assumes  a certain structure and/or performs a specific function. A recent paper in Nature has attempted to design sequences for each of five naturally occurring protein folds. The success rate ranges from 10-40%.

This recent work comes from the Baker group, who are best known for Rosetta and have made several previous steps in this direction. In a 2003 paper this group stripped several naturally occurring proteins down to the backbone, and then generated sequences whose side-chains were consistent with these backbone structures. The sequences were expressed and found to fold into proteins, but the structure of these proteins remained undetermined. Later that same year the group designed a protein, Top7, with a novel fold and confirmed that its structure closely matched that of the design (RMSD of 1.2A).

The proteins designed in these three pieces of work (the current paper and the two papers from 2003) all tend to be more stable than naturally occurring proteins. This increased stability may explain why, as with the earlier Top7, the final structures in the current work closely match the design (RMSD 1 or 2A), despite ab initio structure prediction rarely being this accurate. These structures are designed to sit in a deep potential well in the Rosetta energy function, whereas natural proteins presumably have more complicated energy landscapes that allow for conformational changes and easy degradation. Designing a protein with two or more conformations is a challenge for the future.

In the current work, several sequences were designed for each of the fold types. These sequences have substantial sequence similarity to each other, but do not match existing protein families. The five folds all belong to the alpha + beta or alpha/beta SCOP classes. This is a pragmatic choice: all-alpha proteins often fold into undesired alternative topologies, and all-beta proteins are prone to aggregation. By contrast, rules such as the right-handedness of beta-alpha-beta turns have been known since the 1970s, and can be used to help design a fold.

The authors describe several other rules that influence the packing of beta-alpha-beta, beta-beta-alpha and alpha-beta-beta structural elements. These relate the lengths of the elements and their connective loops with the handedness of the resulting subunit. The rules and their derivations are impressive, but it is not clear to what extent they are applied in the design of the 5 folds. The designed folds contain 13 beta-alpha-beta subunits, but only 2 alpha-beta-beta subunits, and 1 beta-beta-alpha subunit.

An impressive feature of the current work is the use of the Rosetta@home project to select sequences with funnelled energy landscapes, which are less likely to misfold. Each candidate sequence was folded >200000 times from an extended chain. Only ~10% of sequences had a funnelled landscape. It would have been interesting to validate whether the rejected sequences really were less likely to adopt the desired fold — especially given that this selection procedure requires vast computational resources.

The design of these five novel proteins is a great achievement, but even greater challenges remain. The present designs are facilitated by the use of short loops in regions connecting secondary structure elements. Functional proteins will probably require longer loops, more marginal stabilities, and a greater variety of secondary structure subunits.

Talk: Membrane Protein 3D Structure Prediction & Loop Modelling in X-ray Crystallography

Seb gave a talk at the Oxford Structural Genomics Consortium on Wednesday 9 Jan 2013. The talk mentioned the work of several other OPIG members. Below is the gist of it.

Membrane protein modelling pipeline

Homology modelling pipeline with several membrane-protein-specific steps. Input is the target protein’s sequence, output is the finished 3D model.

Fragment-based loop modelling pipeline for X-ray crystallography

Given an incomplete model of a protein, as well as the current electron density map, we apply our loop modelling method FREAD to fill in a gap with many decoy structures. These decoys are then scored using electron density quality measures computed by EDSTATS. This process can be iterated to arrive at a complete model.

Over the past five years the Oxford Protein Informatics Group has produced several pieces of software to model various aspects of membrane protein structure. iMembrane predicts how a given protein structure sits in the lipid bilayer. MP-T aligns a target protein’s sequence to an iMembrane-annotated template structure. MEDELLER produces an accurate core model of the target, based on this target-template alignment. FREAD then fills in the remaining gaps through fragment-based loop modelling. We have assembled all these pieces of software into a single pipeline, which will be released to the public shortly. In the future, further refinements will be added to account for errors in the core model, such as helix kinks and twists.

X-ray crystallography is the most prevalent way to obtain a protein’s 3D structure. In difficult cases, such as membrane proteins, often only low resolution data can be obtained from such experiments, making the subsequent computational steps to arrive at a complete 3D model that much harder. This usually involves tedious manual building of individual residues and much trial and error. In addition, some regions of the protein (such as disordered loops) simply are not represented by the electron density at all and it is difficult to distinguish these from areas that simply require a lot of work to build. To alleviate some of these problems, we are developing a scoring scheme to attach an absolute quality measure to each residue being built by our loop modelling method FREAD, with a view towards automating protein structure solution at low resolution. This work is being carried out in collaboration with Frank von Delft’s Protein Crystallography group at the Oxford Structural Genomics Consortium.