Category Archives: Uncategorized

Finding Communities in Networks: Link Clustering and the Affiliation Graph Model

The research I do revolves around how to break down protein interaction networks (PINs) into functional modules, or communities, using community detection methods. These communities are groups of proteins which interact more with one another than with the rest of the network. We decompose the PINs in this manner, as it is very difficult to determine how a protein functions through its interactions by simply looking at the entire network. The sheer amount of data in one of these networks, clouds the information we would actually like to see, as explained in my previous blog post.

In a journal club in August, I presented the overlapping community detection method Link Clustering by Ahn et al. which is the first community detection method I am aware of that assigns nodes to communities by clustering the edges between the nodes. I contrasted Link Clustering with the Affiliation Graph Model method proposed by Yang and Leskovec, as the authors use the same comparison technique to validate their methods. In a world where studies usually only work on a single network using a specific method, having two independent papers that use the same validation method is something that gets me much more excited than it probably should.

 

Link Clustering

Instead of assigning nodes to communities, Link Clustering focusses on the edges in a network. Clustering edges is essentially an equivalent problem to clustering nodes. This becomes obvious if you take a graph, decide to call all the edges “blobs” and say two of your new “blobs” have an edge between them if they share a node in the original network. This is called creating the “line graph” of the network and performing “blob” clustering on this is the same as performing edge clustering on the initial network.

To come back from my tangent, if an edge is assigned to community A, this means the nodes on either side of this edge are assigned community A by association. By assigning each edge to an individual community, Link Clustering creates overlapping node communities as shown in Figure 1 below.

Example of a simple, link-clustered network. Nodes can be seen to be in multiple communities as edges are assigned to communities (c.f. node 4 in red, green and yellow communities). [Ahn et al., 2010]

Figure 1: Example of a simple, link-clustered network. Nodes can be seen to be in multiple communities as edges are assigned to communities (c.f. node 4 in red, green and yellow communities). [Ahn et al., 2010]

In Link Clustering, edges are placed into the same community, based on similarity scores computed between all connected edges. Edge clusters are computed from these similarity scores using single-linkage clustering, where the edges with the highest similarity scores are iteratively merged into the same community. Using this method all edges end up in the same community in the end, so a threshold for the similarity score needs to be found at which the network is partitioned into communities in the best way. This threshold value represents a type of resolution parameter of the for community detection. At a low threshold there are few, large communities as many edges have been clustered together, while at a high threshold only highly similar edges are in the same community and there are many, small communities. Ahn et al. propose maximizing a function called the partition density to find the optimal resolution threshold. This function is simply a weighted average of the density of the communities. For those incredibly keen, it is given by the equation:

D = \frac{2}{M} \sum_c m_c \frac{m_c - (n_c -1)}{(n_c - 2)(n_c -1)}

Here, M represents the total number of edges in the network and m_c and n_c represent the number of edges and nodes in the community c respectively.

All of this partitioning is based on a similarity score between two edges. The crux of this method, the scoring measure, is explained in Figure 2.

Schematic Diagram showing how the similarity between edges e_ik and and e_jk is calculated in Link Clustering. The overlap of the inclusive neighbour sets of nodes i and j is divided by the union of these sets. The inclusive neighbour set of node i is computed by taking the neighbours of node i and including node i itself. The nodes in grey are overlapping between both sets, therefore the similarity between edges e_ik and e_jk is 4/12 or 1/3. [Ahn et al 2010]

Figure 2: Schematic Diagram showing how the similarity between edges e_{ik} and and e_{jk} is calculated in Link Clustering. The overlap of the inclusive neighbour sets of nodes i and j is divided by the union of these sets. The inclusive neighbour set of node i is computed by taking the neighbours of node i and including node i itself. The nodes in grey are overlapping between both sets, therefore the similarity between edges e_{ik} and e_{jk} is 4/12 or 1/3. [Ahn et al. 2010]

Affiliation Graph Model

The method which I would like to compare Link Clustering to is the Affiliation Graph Model (AGM). Briefly, this is a statistical community detection algorithm which generates overlapping communities. Node affiliations are recorded in the Community Affiliation Matrix shown in Figure 2.

Schematic of the Affiliation Graph Model, which shows nodes (coloured squares) being affiliated with communities (circles A, B, and C) as shown by the arrows. Nodes can have multiple community affiliations as indicated by red and purple node squares. The probability of interaction between two nodes which are both in only community A, is p_A. [Yang and Leskovec, 2012]

Schematic of the Community Affiliation Matrix, which shows nodes (coloured squares) being affiliated with communities (circles A, B, and C) as shown by the arrows. Nodes can have multiple community affiliations as indicated by red and purple node squares. The probability of interaction between two nodes which are both in only community A, is p_A. [Yang and Leskovec, 2012]

The AGM has a probability of interaction for each community, denoted p_A for community A. For nodes u and v, which may share several community affiliations, the probability of interaction, p(u,v) is given by the equation:

p(u,v) = 1 - \prod_{k \in C_{uv}} (1 - p_k),

where k denotes a community in the set of communities shared by nodes u and v, C_{uv}. This model is be fitted to the network using a Metropolis-Hastings algorithm with a Markov chain constructed by pre-defined defined steps in the space of possible Community Affiliation Matrices (c.f. Yang and Leskovec 2012).

Result Comparison Method

Link Clustering itself has some very interesting results, especially relating to looking at real-world networks at different resolutions (which can be found in the paper). However, I want to focus on the validation method used in the paper which compares results generated from Link Clustering with that of other methods. This comparison method evaluates four different aspects of the partitions generated by different methods. These four aspects are: Community Quality, Overlap Quality, Community Coverage, and Overlap Coverage. The “Quality” aspects relate to network metadata indicating how good the obtained communities are, and the “Coverage” aspects relate to the amount of information extracted from a network.

The metadata referred to above is a general term for additional data available about the nodes in a network. In the case of PINs this metadata is related to the function of proteins in the network (their Gene Ontology annotations). Loosely, proteins with similar functions should be placed in the same community to improve the Community Quality measure. Overlap Quality concerns the boundaries between generated communities. If the functions assigned to proteins show similar boundaries between groups of functions, the Overlap Quality is high.

The “Coverage” values are more basic calculations. A partition has a high Community Coverage if the fraction of nodes assigned to communities with three or more nodes is high (non-trivial communities). A community of size two is uninformative, as it is the result of a single edge being in a community by itself. Overlap Coverage is simply the number of non-trivial community memberships per node. The two “Coverage” values are thus equal for non-overlapping community detection methods.

When comparing community detection methods, these four measures are computed for a partition generated by each method, and then rescaled so that the maximum score achieved by any method is 1 for each measure independently. These values are then added to give a score between 0 and 4 for each community detection method.

 

Results

The results shown in Figure 4 were generated by Yang and Leskovec to show their AGM outperforms Link Clustering (and other methods) on a variety of networks. While it is noteable that they used the same metric and networks used by Ahn et al. when proposing Link Clustering, this Figure must be taken with a pinch of salt. Yang and Leskovec acknowledge that the metric proposed by Ahn et al. rewards methods which find more communities in a network and thus do not fit the number of communities judged to be best by the AGM, but instead fit the same number as fitted in Link Clustering. Furthermore, it is peculiar that half of the networks used for comparison by Ahn et al. have disappeared in this comparison.

Community detection method comparison for Link Clustering (L), Clique Percolation (C), Mixed-Membership Stochastic Block Model (M) and the Affiliation Graph Model (A) on different PINs (PPI) and other social networks. Methods are compared by the metric proposed by Ahn et al. [Yang & Leskovec 2012]

Figure 4: Community detection method comparison for Link Clustering (L), Clique Percolation (C), Mixed-Membership Stochastic Block Model (M) and the Affiliation Graph Model (A) on different PINs (PPI) and other social networks. Methods are compared by the metric proposed by Ahn et al. [Yang & Leskovec 2012]

To conclude we can say that while it is very interesting that two papers use the same method to compare their methods to others for validation purposes, it wasn’t done completely accurately here. The comparison metric is however an interesting development to create a gold standard for method comparison. For my purposes, Community Quality is the most important, so maybe a weighted version of this may be more interesting.

 

ECCB 2014 (Strasbourg)

The European Conference on Computational Biology was held in Strasbourg, France this year. The conference was attended by several members of OPIG including Charlotte Deane, Waqar Ali, Eleanor Law, Jaroslaw Nowak and Cristian Regep. Waqar gave a talk on his paper proposing a new distance measure for network comparison (Netdis). There were many interesting talks and posters at the conference, and brief summaries of the ones we found most relevant are given below.

The impact of incomplete knowledge on the evaluation of protein function prediction: a structured output learning perspective.

Authors: Yuxiang Jiang, Wyatt T. Clark, Iddo Friedberg and Predrag Radivojac

Chosen by: Waqar Ali

The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy.

In the paper, the authors studied the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, they provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. They also analyzed real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. They find a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, their results also suggest that current large-scale evaluations are meaningful and surprisingly reliable. The choice of function prediction methods evaluated by the authors is not exhaustive however and it is quite possible that other methods might be much more sensitive to incomplete annotations.

Towards practical, high-capacity, low-maintenance information storage in synthesized DNA.

Authors: Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M. LeProust, Botond Sipos & Ewan Birney

Chosen by: Jaroslaw Nowak

This was one of the keynote speaker talks. One of the authors, Ewan Birney discussed how viable is storing digital information in DNA code. The paper talked about storing 739 kilobytes of hard-disk storage in the genetic code. The data stored included all 154 of Shakespeare’s sonnets in ASCII text, a scientific paper in PDF format, a medium-resolution colour photograph of the European Bioinformatics Institute in JPEG format, a 26-s excerpt from Martin Luther King’s 1963 ‘I have a dream’ speech in MP3 format and a Huffman code used in their study to convert bytes to base-3 digits (ASCII text). The authors accomplished this by first converting the binary data into base-3 numbers using Huffman coding (which represents the most common piece of information using the least bits). The base-3 numbers where then converted into a genetic code in such a way that produced no homopolymers. The authors also proved that they can read the encoded information and reconstruct the data with 100% accuracy.

Using DNA for information storage could very soon become cost-effective for sub-50 years archiving. The current costs of the process are $12,400/MB for writing and $220/MB for reading information, with negligible costs of copying. Nevertheless, if the current trends persist, we could see a 100 – fold drop in costs in less than a decade. The main advantages of storing information in DNA is the low maintenance and durability of the medium (intact DNA fragments have been recovered from samples that are tens of thousands years old) as well as little physical space required to store the information (~2.2 PB/g)

 

PconsFold: Improved contact predictions improve protein models.

Authors: Michel M, Hayat S, Skwark MJ, Sander C, Marks DS and Elofsson A.

Chosen by: Eleanor Law

De novo structure prediction by fragment assembly is a very difficult task, but can be aided by contact prediction in cases where there is plenty of sequence data. Contact prediction has also significantly improved recently, using statistical methods to separate direct from indirect contact information.

PSICOV and plmDCA are two such methods, providing contacts which can be used by software such as Rosetta as an additional energy term. PconsFold combines 16 different sets of contact predictions by these programs, built from different sequence alignments, with secondary structure and solvent accessibility prediction. The output of the deep learning process on these inputs is more reliable that the individual contact predictions alone, and produces more accurate models. The authors found that using only 2,000 decoys rather than 20,000 did not greatly harm their results, which is encouraging as the decoy generation stage is the particularly resource intensive stage. Using a balance between the Rosetta energy function and weight of contact prediction, the optimal number of constraints to include was around 2 per residue, compared to 0.5 per residue for PSICOV or plmDCA alone.

The PconsFold pipeline is not always able to make full use of the contact prediction, as accuracy of contact prediction on the true structure can be higher than that in the model. This is a case where the conformational search is not effective enough to reach the correct answer, though it would be scored correctly if it were obtained. All-beta proteins are the most difficult to predict, but PconsFold compares favourably to EVfold-PLM for each of the mainly alpha, mainly beta, and alpha & beta classes.

ECCB_feedback_Eleanor

Modeling of Protein Interaction Networks

In the group meeting on the 20th of August, I presented the paper by Vazquez et al (2002). This was one of the first papers proposing the duplication divergence model of evolution for protein interaction networks, and thus has had a significant impact on the field, inspiring many variants of the basic model. The paper starts out by noting that the yeast protein network has the ‘small world’ property – following along links in the network, it requires only a handful of steps to go from any one protein to any other. Another property is the manner in which the links are shared out among the various proteins: empirically, the probability that a protein interacts with k other proteins follows a power-law distribution.

Vazquez et al. show how evolution can produce scale-free networks. They explore a model for the evolution of protein networks that accurately reproduces the topological features seen in the yeast S. cerevisae. As the authors point out, proteins fall into families according to similarities in their amino-acid sequences and functions, and it is natural to suppose that such proteins have all evolved from a common ancestor. A favoured hypothesis views such evolution as taking place through a sequence of gene duplications – a relatively frequent occurrence during cell reproduction. Following each duplication, the two resulting genes are identical for the moment, and naturally lead to the production of identical proteins. But between duplications, random genetic mutations will lead to a slow divergence of the genes and their associated proteins. This repetitive, two-stage process can be captured in a relatively simple model for the growth of a protein interaction network .

Simulations by the author show that the model, starting out with a seed capture the degree distribution of the yeast network with high fidelity (Fig 1) and also possesses the quality of high tolerance to random node removal seen in biological networks. While the results are more qualitative in nature, the model still serves as the basis of most biologically rooted explanations of protein network evolution, with minor improvements. Some of these additions have been the use of asymmetric divergence, whole genome duplication events as well as interaction site modelling. As the jury is still out on what model (if any) best fits current interaction data, the basic model is still relevant as a benchmark for newer models.

Fig. 2. Zipf plot for the PIN and the DD model with p = 0.1, q = 0.7 with N = 1,825. k is the connectivity of a node and r is its rank in decreasing order of k. Error bars represent standard de- viation on a single network realization.

Fig. 1. Zipf plot for the PIN and the DD model with p = 0.1, q = 0.7 with N = 1,825. k is the connectivity of a node and r is its rank in decreasing order of k. Error bars represent standard deviation on a single network realization.

PLOS Comp Bio’s front page!

The work of our molecular dynamic expert, Bernhard Knapp, has made the front page of the PLOS Computational Biology . The image shows how CDR3 loops scan a peptide/MHC complex (Figure below: Snapshot from the website). Immune cells in the human body screen other cells for possible infections. The binding of T-cell receptors (TCR) and parts of pathogens bound by major histocompatibility complexes (MHC) is one of the activation mechanisms of the immune system. There have been many hypotheses as to when such binding will activate the immune system. In this study we performed the, to our knowledge, largest set of Molecular Dynamics simulations of TCR-MHC complexes. We performed 172 simulations each of 100 ns in length. By performing a large number of simulations we obtain insight about which structural features are frequently present in immune system activating and non-activating TCR-MHC complexes. We show that many previously suggested structural features are unlikely to be causal for the activation of the human immune system.

How CDR3 loops scan a peptide/MHC complex. Rendering is based on the 100 ns simulation of the wild-type TCRpMHC complex. Blue: peptide; Solid white cartoon and transparent surface: first frame of the MHC helices and β-sheet floor; Orange: equally distributed frames of the CDR3s over simulation time. Transparent white cartoon: first frame of all other TCR regions.

How CDR3 loops scan a peptide/MHC complex. Rendering is based on the 100 ns simulation of the wild-type TCRpMHC complex. Blue: peptide; Solid white cartoon and transparent surface: first frame of the MHC helices and β-sheet floor; Orange: equally distributed frames of the CDR3s over simulation time. Transparent white cartoon: first frame of all other TCR regions.

DSSP

My talk this week focused on secondary structure (SS) assignment. What do I mean by this? It is assigning SS types (principally α-helices and β-sheets) to protein structures. It can be found hiding in many of the things we do – e.g. alignment and modelling. There are many available methods to do this, of which DSSP (despite being published in 1983) is the most popular.

DSSP

How does it work?

The algorithm identifies hydrogen bonds between mainchain carbonyl and amide groups. Partial charges are applied to the amide and carbonyl bonds, and the the C, O, N, and H atoms are assumed to be point charges (hence C has charge +ρ1, O -ρ1, N -ρ2, and H +ρ2. The the electrostatic energy between these 4 atoms is calculated, and if it is < -0.5 kcal/mol, a hydrogen bond exists. This is a relatively relaxed threshold, as a normal hydrogen bond in an $alpha;-helix is in the region of -3 kcal/mol, so it means that a given residue could have i+3, i+4, and i+5 hydrogen bonds.

Helices and sheets are then identified where there are characteristic hydrogen bond patterns. For example, two consecutive i to i+4 backbone hydrogen bonds indicates an α-helix turn. The algorithm identifies each turn, and each β-bridge, and where several of these overlap, they are combined into single elements.

DSSP has 8 different SS assignments:

  • G – 310 helix
  • H – α-helix
  • I – π-helix
  • E – β-sheet
  • B – β-bridge
  • T – helix turn
  • S – bend (high curvature)
  • C – coil (none of the above)

These are assigned in an order of preference – HBEGITSC.
Many (but by no means all) SS assignment programs still use this notation.

DSSP is one of the more simple SS assignment programs. Its hydrogen bond energy calculation is distinctly simplistic. It does not (fully) take the angles of the hydrogen bond into account, and provides only a binary classification for each hydrogen bond. However, perhaps surprisingly, DSSP is still the most used method. Why? Probably something to do with them giving it away for free, which resulted in many software suits incorporating it (e.g. JOY, PROMOTIF). As a general rule, if something does not say what it uses for SS assignment, it probably uses DSSP.

Other Methods

Given the simplicity of DSSP, it is not surprising that there are a large number of other available methods. Indeed, you may notice that different programs will give different assignments (e.g. comparing Pymol to VMD or the PDB annotation).

There have been a vast number of other secondary structure (SS) annotation methods published, including: STRIDE, DEFINE, PROMOTIF, KAKSI, SST, PSSC, P-SEA, SECSTR, XLLSSTR, PALSSE, and STICK. The other two you are likely to come across are STRIDE, and the PDB annotation.

SS assignment in general

All of the SS assignment methods rely on a combination of three features of SS. These are:

  1. Mainchain hydrogen bonds
  2. Φ and Ψ angles
  3. Inter Cα distances

For all three of these, there are values characteristic of both helices and β-sheets. Every method takes some combination of these three features for each residue, and if they are within the chosen limits, classifies the residue accordingly. The limits are generally chosen to give good agreement with the PDB annotation.
(It should be noted that the hydrogen-bond containing methods use the position of the hydrogen atom, which is rarely present in the crystal structure, and thus must be inferred.)

STRIDE can be described as an updated version of DSSP. It has two main differences – it uses a much fuller description of hydrogen bond geometry, and combines this with a knowledge based φ/ψ angle potential to assign the residue state. It has many more parameters that DSSP, and these are trained based on the PDB annotation. So where does that come from?

This PDB annotation comes from the depositors own annotation. The current guidance (from here) is to use the generated annotation, from PROMOTIF. PROMOTIF uses DSSP, with a slight change – it annotates an extra residue at the end of each structure element. I am in no position to say how well this guidance is adhered to by the depositors, or what their historical behaviour was, but the vast majority of annotations are reasonable.

I guess you are now wondering how different these methods are. Generally they agree in the obvious cases, and disagreement is normally at the ends of SS elements. Other examples (particularly pertinent to my research) occur when one method identifies a single long element, while another method identifies two elements seperated by a coil section. Ultimately there is no ‘right’ answer, so saying one method is right and another wrong is impossible.

To sum up, DSSP is the de facto standard. Ignoring my previous comment, it is probably not the best algorithm, as it is a gross simplification. STRIDE improves on the algorithm (although using more parameters), whilst for specific tasks, one method may be better than all of the others. It is hard to say if one is the best, and if it is important to you, then you should think about which method to use. If you do not think it is, then you should reconsider, and if it really is not important, then just use DSSP like everyone else. This is perhaps an example where willing, free, provision your code to the community results in your method (DSSP) becoming the de facto standard.

Loopy LaTeX

Something that you may or may not know, is that you can write loops in LaTeX. It is, in fact, a Turing-complete language, so you can write if statements too (which is helpful for making multi-line comments), and do all the other things you can do with a programming language.

So, I guess you are thinking, ‘eh, that’s cool, but why would I do it?’ Indeed, I have known about this for ages, but never found any need for it, until recently. My problem was that I had generated 80 small images, that I wanted to display on a number of pages. I could have played with print settings, and made multiple pages per sheet, but since I wanted two columns, and to fit 16 to a page (the 80 images were in 5 sets of 16, each with two subsets), that was going to be a pain. I also wanted to add some labels to each set, but not have said label on every image. However, I thought that LaTeX could solve my problem.

As ever, there are a number of different latex packages that you can use, but I used the pgffor package. In the example below, my pictures were named [A-E]_[ab][1-8]_picture, e.g. A_b2_picture.png, or D_a3_picture.png. The code produces pages of 16 pictures, with the ‘a’ pictures on the left, and the ‘b’ pictures on the right.
There is a more simple example at the bottom.

\documentclass[a4paper,10pt]{article}
\usepackage[utf8]{inputenc}
\usepackage{pgffor}
\usepackage{subfigure}
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{fullpage}

\begin{document}
\section{}

\foreach \method in {A, B, C, D, E}{ %looping over each set of 16 
  \begin{figure}
  \foreach \i in {1,...,8}{ %looping over each subset, the a set first, then the b set
    \centering     
    \subfigure[]{\label{fig:\method \i a}\includegraphics[width=0.45\textwidth]{\method _a\i _picture}} 
    \subfigure[]{\label{fig:\method \i b}\includegraphics[width=0.45\textwidth]{\method _b\i _picture}}
  }
  \caption{\method}
  \end{figure}
}

%a more simple example
\foreach \number in {1,...,10}{
\number \ elephant
}  

\end{document}

Happy LaTeXing..

Computational Antibody Affinity Maturation

In this week’s journal club, we reviewed a paper by Lippow et al. in Nature Biotechnology, which features a computational pipeline that is capable of maturing antibodies (Abs) by up to 140-fold. The paper itself discusses 4 test case Abs (D44.1, cetuximab, 4-4-20, bevacizumab) and uses changes in electrostatic energy to identify favourable mutations. Up to the point when this paper was published back in 2007, computational antibody design was an (almost) unexplored field of research – except for a study by Clark et al. in 2006, no one else had done anything like the work presented in this paper.

The idea behind the paper is to identify certain positions within the Ab structure for mutation and hopefully find an Ab with a higher binding affinity.

The idea behind the paper is to identify certain positions within the Ab structure for mutation and hopefully find an Ab with a higher binding affinity.

Pipeline

Briefly speaking, the group generated a mutant Ab-antigen (Ag) complex using a series of algorithms (dead-end elimination and A*), which was then scored by the group’s energy function for identifying favourable mutations. Lippow et al. used the electrostatics term of their binding affinity prediction in order to estimate the effects of mutations on an Ab’s binding affinity. In other words, instead of examining their entire scoring function, which includes terms such as van der Waal’s energy, the group only used changes in the electrostatic energy term as an indicator for proposing mutations. Overall, in 2 of the 4 mentioned test cases (D44.1 & cetuximab), the proposed mutations were experimentally tested to confirm their computational design pipeline – a brief overview of these two case studies will be described.

Results

In the case of the D44.1 anti-lysozyme Ab, the group proposed 9 single mutations by their electrostatics-based calculation method; 6/9 single mutants were confirmed to be beneficial (i.e., the mutant had an increased binding affinity). The beneficial single mutants were combined, ultimately leading to a quadruple mutant structure with a 100-fold improvement in affinity. The quadruple mutant was then subjected to a second round of computer-guided affinity maturation, leading to a new variant with six mutations (effectively a 140-fold improvement over the wild-type Ab). This case study was a solid testimony to the validity of their method; since anti-lysozyme Abs are often used as model systems, these results demonstrated that their design pipeline had taken, in principle, a suitable approach to maturing Abs in silico.

The second case study with cetuximab was arguably the more interesting result. Like the D44.1 case above, mutations were proposed to increase the Ab’s binding affinity on the basis of the changes in electrostatics. Although the newly-designed triple mutant only showed a 10-fold improvement over its wild-type counterpart, the group showed that their protocols can work for therapeutically-relevant Abs. The cetuximab example was a perfect complement to the previous case study — it demonstrated the practical implications of the method, and how this pipeline could potentially be used to mature existing Abs within the clinic today.

Effectively, the group suggested that mutations that either introduce hydrophobicity or a net charge at the binding interface tend to increase an Ab’s binding affinity. These conclusions shouldn’t come with huge surprise, but it was remarkable that the group had reached these conclusions with just one term from their energy function.

Conclusions

Effectively, the paper set off a whole new series of possibilities and helped us to widen our horizons. The paper was by no means perfect, especially with respect to predicting the precise binding affinities of mutants – much of this error could be bottled down to the modelling stage of their pipeline. However, the paper showed that computational affinity maturation is not just a dream – in fact, the paper showed that it’s perfectly doable, and immediately applicable. Interestingly, Lippow et al.’s manipulation of an Ab’s electrostatics seemed to be a valid approach, with recent publications on Ab maturation showing that introducing charged residues can enhance binding affinity (e.g. Kiyoshi et al., 2014).

More importantly, the paper was a beautiful showcase of how computational analyses could inform the decision making process in an in vitro framework, and I believe it exemplified how we should approach our problems in bioinformatics. We should not think of proteins as mere text files and numbers, but realise that they are living systems, and we’re not yet at a point where we fully understand how proteins behave. This shouldn’t discourage us from research; instead, it should give us the incentive to take things more slowly, and develop a method/product that could be used to solve greater, pragmatic problems.

Antibody modeling via AMA II and RosettaAntibody

Intro

Protein modeling is one of the most challenging problems in bioinformatics. We still lack a clear theoretical framework which would allow us to link linear protein sequence to its native 3D coordinates. Given that we only have the structures for about a promile of the known seqs, homology modeling is still one of the most successful methods to obtain a structure from a sequence. Currently, using homology modeling and the 1393 known folds we can produce models for more than half known domains. In many cases this is good enough to get an overall idea of the fold but for actual therapeutic applications, there is still a need for high-resolution modeling.

There is one group of molecules whose properties can be readily exploited via computational approaches for therapeutic applications: antibodies.  With blockbuster drugs such as Humira, Avastin or Remicade, they are the leading class of biopharmaceuticals. Antibodies share a great degree of similarity with one another (<50-60% sequence identity) and there are at least 1865 antibody structures in the PDB. Therefore, homology modeling of these structures at high resolution becomes tractable, as exemplified by WAM and PIGS. Here, we will review the antibody modeling paradigm using one of the most successful antibody modeling tools, RosettaAntibody, concluding with the most recent progress from AMA II (antibody CASP).

General Antibody-antigen modeling

Modeling of antibody structures can be divided into the following steps:

  1. Identification of the Framework template
  2. Optimizing Vh/Vl orientation of the template
  3. Modeling of the non-H3 CDRs
  4. Modeling of H3

Most of the diversity of antibodies can be found in the CDRs. Therefore, the bulk of the protein can be readily copied from the framework region. This however needs to undergo an optimization of the Vh/Vl orientation. Prediction of the CDRs is more complicated since they are much more variable than the rest of the protein. Non-H3 CDRs can be modeled using canonical structure paradigms. Prediction of H3 is much more difficult since it does not appear to follow the canonical rules.

When the entire structure is assembled, it is recommended to perform refinement using some sort of relaxation of the structure, coupled with an energy function which should guide it.

RosettaAntibody

RosettaAntibody protocol roughly follows this described above. In the first instance, an appropriate template is identified by highest BLAST bit scores. The best heavy and light chains aligned to the best-BLAST-scoring Fv region. The knowledge-base here is a set of 569 antibody structures form SACS with resolutions 3.5A and better. The Vh/Vl orientation is subsequently refined using local relaxation, guided by Charmm.

Non-H3 CDRs are modeled using the highest-scoring BLAST hit of the same length. Canonical information is not taken into account. Loops are grafted on the framework using the residues overlapping with the anchors.

H3 loops are modeled using a fragment based approach. The fragment library is Rosetta+H3 from the knowledge base of antibody structures created for the purpose of this study. The low-resolution search consists of Monte Carlo attempts to fit 3-residue fragments followed by Cyclic Coordinate Descent loop closure. This is followed by high resolution search when the H3 loop and Vh/Vl are repacked using a variety of moves.

Each decoy coming from the repacking is scored using Rosetta function. The lower the Rosetta score the better the decoy (according to Rosetta).

Results

RosettaAntibody can produce high-quality models (1.4A) on its 54 structure benchmark test. The major limitation of the method (just like any other antibody modeling method) is the H3 loop modeling. It is believed that H3 is the most important loop and therefore getting this loop right is a major challenge.

Right framework and the correct orientation of Vh/Vl have a great effect on the quality of H3 predictions. When the H3 was modeled on using the correct framework, the predictions are order of magnitude better than by using the homology model. This was demonstrated using the native recovery in RosettaAntibody study as well as during ‘Step II’ of the Antibody Modeling assessment where participants were asked to model H3 using the correct framework.

Journal club (Bernhard Knapp): MMPBSA Binding Free Energy Calculations

This week’s topic of the Journalclub was about Molecular Mechanics Poisson−Boltzmann Surface Area (MMPBSA) binding free energy calculations between ligand and receptor using Molecular Dynamics simultions (MD). As an example I selected:

David W. Wright, Benjamin A. Hall, Owain A. Kenway, Shantenu Jha, and Peter V. Coveney. Computing Clinically Relevant Binding Free Energies of HIV-1 Protease Inhibitors. J Chem Theory Comput. Mar 11, 2014; 10(3): 1228–1241

The first question is: Why do we need such rather complex and computationally expensive approaches if other (e.g. empirical) scoring functions can do similar things? The main challenges thereby is that simple scoring functions often do not work very well for systems where they were not calibrated on (e.g. Knapp et al. 2009 (http://www.ncbi.nlm.nih.gov/pubmed/19194661)). The reasons for that are manifold. MD-based approaches can improve two major limitations of classical docking/scoring functions:

1) Proteins are not static. Ligand as well as receptor can undergo various slightly different configurations even for one binding site. Therefore the view of scoring one ligand configuration against one receptor configuration is not the whole picture. The first improvement is to consider a lot of different configurations for one position score of the ligand:

multipleReceptorLigand.png

2) A more physics based scoring function can be more reliable than a simple and run-time efficient scoring function. On the basis of the MD simulations a variety of different terms can be deduced. These include:

dG_formula

– MM stands for Molecular Mechanics. It’s internal energy includes bond stretch, bend, and torsion. The electrostatic part is calculated using a Coulomb potential while the Van der Waals term is calculated using a Lennard-Jones potential.
– PB stands for Poisson−Boltzmann. It covers the polar solvation part i.e. the electrostatic free energy of solvation.
– SA stands for Surface Area. It covers the non-polar solvation part via a surface tension weighted solvent accessible surface area calculation.
– TS stands for the entropy loss of the system. This term is necessary because the non-polar solvation incorporates an estimate of the entropy changes implicitly but does not account for an entropy change upon receptor/ligand formation in vacuo. This term is calculated on the basis of a normal mode analysis.

If all these terms are calculated for each single frame of the MD simulations and those single values are averaged an estimate of the binding free energy of the complex can be obtained. However, this estimate might not represent the actual mean of the spatial distribution. Therefore at least 50 replica MD simulations are needed per investigated complex. In this aspect replica means an identically parameterized simultion of the same complex where only the inital forces are assinged randomly.

On the basis of the described MMPBSA-TS approach in combination with 50 replicas the authors achieve a reasonable correlation (0.63) for the 9 FDA-approved HIV-1
protease inhibitors with know experimental binding affinities. If the two largest complexes are excluded the correlation improves to an excellent value (0.93).

In a current study we are using the same methodology for peptide/MHC interactions. This system is completely different from the protease inhibitor study of Wright et al.: The ligands are peptides and the binding site is a groove consisting of two alpha-helices. The methods was applied as it is (without calibration or any kind of training). Prelimiary data still shows a high correlation with experimental values for the peptide/MHC system. This indicates that this MMPBSA approach can yield reliable predictions for very different systems without further modification.

Protein Folding: Man vs Machine

In 1996 Gary Kasparov, the reigning world chess champion, played IBM’s Deep blue, a computer whose sole purpose was to play chess better than any human. Losing the first match, Gary sprung back swiftly defeating Deep Blue 4-2 over the remaining matches. However, his success was short lived. In a rematch with an updated Deep Blue the following year, the score was 3.5-2.5 to the computer. The media (and IBM) declared this as a pivotal moment in history, where a machine had proven itself better than humanities champion at a game deemed a highly intellectual pursuit. The outcry was that the age of machines had arrived. Was it true? Should humanity have surrendered to machine overloads at that moment? Obviously the answer is a large and resounding no. However, this competition allows for insightful comparison between the manner in which humans and computers play chess and think. By comparing the two, we learn the strengths and weaknesses of both parties from which we can make combined approaches that may exceed either.

Firstly, lets discuss the manner in which a computer “plays” chess. They simply search all possible configurations of moves that are available and pick the most optimal. However, things are not that simple. Consider only the opening sequence, there are 20 possible moves a player can make, so after only a single move by each player there is 400 possible chess positions. This count grows exponentially fast, after 5 moves by each player there is approximately 5 million combinations. For example, it was estimated that Deep Blue could analyse 2 million positions per second. However, since this is not nearly fast enough to examine all possible games from start to end in a reasonable time scale, computers cannot foresee lines of plays which are far in the distance. To overcome this, in the early game the computer will use a reference table developed by grandmasters that list both common openings and the assumed best manner to respond to them. Obviously, these are only assumed as optimal and have never been completely tested. In short, machines participate through a brute force, utilising their intricate ability to perform calculations at high speed to find the best move. However, the search is too large in the initial and end stages of a game to be completely thorough, a reference table is instead used to “inform” of the correct move at these times.

takahashi03

While a human can quite easily see that the following board leads to a draw, computers cannot draw the same conclusion without huge effort.

In contrast, human players use far more visual and spatial recognition alongside both memory and calculation to pick their moves. Like a computer, a player will analyse a portion of the moves available at any given moment. Though since a human cannot compare on computation speed to that of a computer, they cannot analyse nearly the same magnitude of moves. Hence, this subset of moves chosen for analysis must contain the most optimal move(s) to compete against the computer’s raw power. This is where the visual and spatial recognition abilities of humans come to bare. Firstly, a human can easily dissect the board into pieces worth considering and those to be ignored. For example, consider a possible move that would result in your queen being exposed and then taken. A human would conclude this as bad (normally) and discard further moves leading from such a play. A computer, however, would explore the resultant board state. One can see how this immediately and drastically reduces the required search. Another human ability is that a player will often be able to able to see sub-structures within a full set-up that are common in the game and hence can be processed in a known manner. In other words, the game is broken down into fragments which can be processed far easier and with less computation. Obviously, both of the above techniques rely on prior knowledge of chess to be useful, but they based upon our human ability to perceive both the substructure of the game and the overall picture with relative ease.

So how does all this chess talk relate to protein folding? In 2010, the Baker group and creators of the ROSETTA protein fold prediction program produced the protein folding game “Foldit”. In Foldit the general public could attempt to fold proteins for themselves and try to get closer to the native structure than the computer algorithms. Obviously, simplified in presentation to that of academic structural biology, it was hoped that the visual and spatial reasoning abilities of humans, the same ones that differentiated them from machines at chess, would prove useful in protein structure prediction. A key issue within ROSETTA drove this train of thought, the fact that is is relatively bad at exploring fully the confirmation space. Often, it will get stuck in the one general configuration and not explore the fold space fully. Furthermore, due to the size of configuration space, this is not easily overcome with simulated annealing due to the sheer scale of the problem. The ability of humans to view the overall picture meant that it should be easier for them to see other possible configurations. As end goals for Foldit, it was hoped that structures that proved unsolvable by current algorithms would be solved by humans and also that new techniques would emerge as “moves” employed by players to achieve high scores could be studied.

To make a comparison of the structures produced by Foldit players and ROSETTA viable, the underlying energy “scores” that judge a structure is the same between the programs. It is assumed, though is not always true, that the better the score the closer you are to the native fold. In addition,  Foldit players were also able to use a set of optimisation tools that were deterministic and would alter the backbone and side chains to the most optimal local configuration to the arrangement the player would make. This meant that players could focus predominantly on altering the overall structure of the protein rather than the fine detail, such as the position of sidec hains. To make the game as approachable as possible, technical terms were replaced by common analogues and visual cues where displayed to highlight poor scoring areas of the protein. For example, clashes between atoms are shown via large spiked red orbs, while the backbone is coloured from green to red depending on how well buried the hydrophobic residues on that segment are. To drive players, gamification elements were also included such as leader boards and rewarding “fireworks” as graphical effects.

To objectively compare the ability of the player base to that of the ROSETTA algorithm, they performed blind predictions on a set of 10 proteins whose structure were not in the public domain. This was run in a similar manner to CASP for those familiar with that set-up. The results exemplified the innate human ability of visual and spatial recognition. In 5 of the cases the playerbase performed significantly better than the ROSETTA program. In 3 of the cases they performed similar. And in the remaining 2 cases the ROSETTA algorithm performed better, though in both of these the model produced was still extremely far from the native structure. Looking through the cases individually, it was identified that the most crucial element used by players was that they were able to deal with large rearrangements that ROSETTA struggled to deal with, including register shifts and strand swapping. This highlights the ability of humans to view the overall picture and to persevere through “bad scoring patches” to reach a more optimal configuration.

foldit_protein

Comparison of foldit player’s solutions (green) to ROSETTA’s solutions (red) and the native 2KPO protein structure (blue). The players correctly identified a strand swap needed to reach the native form, while this large reconfiguration was not seen by ROSETTA.

Since the release of the game and the accompanying paper in 2010, Foldit has received much praise in conveying the field of protein folding in an approachable manner to so many people. In addition, the player base has contributed to science as whole. In 2011 the player base successfully solved the structure of a M-PMV protein, a retrovirus whose structure was unobtainable via normal means. Then in 2012, by analysing the common set of moves employed by the player base, they collectively produced an algorithm that outperforms previously published fold prediction methods. Personally, I think of Foldit as a fun and relative intuitive game that introduces the core elements of the protein folding problem. As to its scientific merit, I’m unsure as to how much impact it will continue to have. As Saulo discussed last week, if infinite monkeys have infinite time then Shakespeare will be reproduced. Likewise, if enough people manipulate a protein structure, eventually the best structure will be found. Though who am I to judge, if people find the game fun, then there are far worse past-times one can have than trying to solve structures. As a finishing note I would be extremely interested in using Foldit to teach structural biology in the future, though feel it is overall too simple for a university setting.