Monthly Archives: January 2014

CAPRI, pyDock and the state of protein-protein docking.


We have discussed a paper highlighting the latest progress in the field of protein-protein docking. It covers the results of the participation of the pyDock group in CAPRI. The latest round of the experiment is particularly interesting as it challenged the participants with affinity prediction which is going one step beyond modeling protein-protein complexes.

Role of the group in this CAPRI edition

For those that are not very familiar with CAPRI, one can participate in any of the following roles:

  • Predictors: Participants are given the unbound structures or sequences of one or both interacting partners. Predictors are then supposed to submit a ranked list of modeled complexes which can be achieved by any docking/prediction tools and some human intervention. This tests the current ad hoc capacity to model protein-protein complexes.
  • Servers: Participants are given the unbound structures or sequences of one or both interacting partners that are supposed to be submitted to an automatic server which needs to return the results (ideally) within 24 hours. This tests the automated capacity to model protein-protein models.
  • Scorers: The community contributes sets of poses (unranked) for each target. The scorers are then supposed to re-rank the poses. This is a form of cross-validation of methods since one could expect that the group that samples the poses also performs better scoring. This exercise is supposed to quantify this.

The pyDock group participates in the capacity of Predictors and Scorers. Below we first outline the general docking methodology, followed by the particular implementation of the pyDock group.

General Docking Methodology

Protein-protein docking starts with two interacting structures at input: (ligand (L) and receptor (R)) (if the structures are unavailable a model is created). The docking techniques can be broadly divided into two types: template based and template-free. In template-based docking, one needs either a complex of homologs or one that is sufficiently structurally similar. However since it is estimated that the number of complexes in the PDB only covers ~4% of the presumed total, this approach is not applicable in a great number of cases.

Template-free docking does not rely on pre-solved complexes and it is the main focus of this post. Standard modus-operandi of a template-free protein-protein docker is the following:

  1. Sample N poses.
  2. Re-score top D sampled poses.
  3. Refine the top K poses.

In the first step, possible poses of ligand with respect to the receptor. This is very often achieved by FFT (ZDOCK) or more elaborate methods such as geometric hashing/pose clustering (PatchDock). The possible orientations are ranked by a simplified energy function or a statistical potential. In most cases this step is performed in rigid-body mode (intra-molecular distances are not allowed to change) for the computational efficiency. The number of sampled poses is usually in the thousands (N~ 2k-10k).

The re-scoring set uses a more elaborate energy function/statistical potential to identify more close-to native poses. Notable examples include pyDock and ZRANK. Since such functions are usually more expensive computationally (for instance pyDock calculates the LJ potential, Coulombic electrostatics and desolvation terms), it is more tractable to apply them to a smaller set of (D<<N) of top poses as returned by the rigid-body sampler. Additionally, the final poses might be pruned for redundancy by removing structures which are very similar to each other. One method in particular (ClusPro) owes its success to scoring the rankings based on the numbers of similar decoys per cluster out of a number of top predictions.

The final step constitutes the refinement of the select top K poses (K<<D). Here, flexibility might be introduced so as to account for conformational changes upon binding. Tools used to achieve this are very computationally expensive energy fields such as AMBER or CHARMM. The coordinates of side-chains or even backbones are distorted, for instance using normal mode calculations, until the energy function reaches an energetic minimum via a flavor of gradient descent.

Fig. 1: Breakdown of results of the pyDock group at the latest CAPRI round. The results are presented for each target, in the predictor as well as scorer capacity.

Fig. 1: Breakdown of results of the pyDock group at the latest CAPRI round. The results are presented for each target, in the predictor as well as scorer capacity.

The pyDock group Methodology and Results

The pyDock group follows the pattern outlined above. As a sampler they employ ZDOCK and FTDOCK, both fast rigid-body Fast Fourier Transform-based methods. They use their pyDock function to score the best models. Additionally, they remove redundancy from the final lists of decoys by removing these entries which are too similar to the higher-scoring ones according to ligand rmsd. The final refining step is carried out using TINKER.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

The pipeline outlined above is carried out in most of targets, however there are some exceptions. For some targets, additional docking poses were generated using SwarmDock (T53, T54, T57 and T58) and RotBUS (T46 and T47). In some cases rather than using TINKER at the refinement stage, CHARMM or AMBER were used.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Such ad hoc approaches were employed for almost every target. Available information was employed to the fullest in order to achieve best models. For instance, in cases where good homology structures were available (T47), or it was known which residues appear of importance to the complex, appropriate distance constraints were imposed. The pyDock group achieves better accuracy as predictors rather than scorers when it comes to docking (67% correct models submitted against 57%). They follow a trend wherein predictors usually do better than scorers (See Fig. 1). This trend is however broken at the stage of predicting the water molecules at the interface of T47. Using DOWSER In the predictor capacity they correctly identify 20% contact-mediating water molecules and 43% as scorers.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Overall, the performance of docking is indicating that the field is steadily moving forward towards achieving the goal of modeling complexes using sequence alone. There were some cases in this round where predictor groups started with sequence only and still produced reasonable models of complexes (T47, T50 and T53). In each of these cases one of the partners was an unbound structure and the other was a sequence. The only case where both partners were sequences did not produce any reasonable models. In this case only two groups out of 40 managed to present meaningful solutions.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Furthermore, this CAPRI round was the first to introduce affinity prediction – in targets T55 – T56. The aim was to predict the binding energy changes upon mutations. The predictor designed by the pyDock group achieved a good results on this test case with more in-depth details on the approach found in a related community-wide experiment.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.

Receptor is shown in white. Best ligand as predictors in red, as scorers in blue.


Journal Club: A Novel Insight into Gene Ontology Semantic Similarity

In a past journal club, I presented a paper by Xu et al on their method of quantifying semantic similarity of proteins based on the Gene Ontology. The Gene Ontology (GO) is a directed acyclic graph (DAG, c.f. Figure 1) of terms with which a protein or a gene can be annotated. These terms are designed to describe the process the protein is involved in, what role it fulfils in this process, and where in the cell it is localized. Thus, when comparing the function of two proteins in a quantifiable way, it has become standard to refer back to the GO terms these proteins are annotated with and compare these based on their relationship in the DAG.

Schematic Diagram of a Directed Acyclic Graph (DAG).

Figure 1: Schematic Diagram of a Directed Acyclic Graph (DAG).

As opposed to many methods, which measure the importance of a node (GO-term) in the DAG as its information content given an external database, the method proposed by Xu et al measures semantic similarity independently of external resources, which gives it the appearance of an independent measure. Furthermore, it claims to be good at dealing with badly annotated proteins, which is often a big problem in functional similarity calculations.

The similarity measure is a hybrid between node-based and edge-based methods, and is seemingly inspired by Wang et al’s 2007 paper and Shen et al’s 2010 paper. It is based on what they call “Shortest Semantic Differentiation Distance” or (SSDD), which is calculated over the shortest distance between two GO-terms on the DAG. When comparing the GO-terms A and B, the shortest path is measured by traversing the DAG upwards from node A to the lowest common ancestor of both nodes and down to node B.

The SSDD calculation over the shortest path is based on their so-called semantic Totipotency values assigned to the terms in the DAG that are part of the shortest path. The semantic Totipotency, T, of a term is calculated by:

Semantic Totipotency Measure

Semantic Totipotency Measure

where the weight, ω, is given by:

Weight, ω

Weight, ω

Here, Dst(t) denotest the number of descendents of the term t, and tp denotes the parent term of term t. Thus, the T-value of every node is both an expression of the depth of the DAG in this area and the coverage.

Finally, the SSDD is calculated by:

Semantic Similarity Differentiation Distance

Shortest Semantic Differentiation Distance

And subsequently the similarity of two GO terms is measured by:

Screenshot from 2014-01-27 12:47:46



In their paper Xu et al showed the method to be competitive compared to other methods which compute protein functional similarity by pairwise GO-term comparisons, while also outperforming a common graph-based method in simUI. While these results look promising, the biological interpretability of such a semantic similarity measure remains difficult.

The strongest advantage of the SSDD method proposed was however its alleged invariance to annotation richness of proteins, which was presented as shown in Figure 2 below (Figure 5 in the paper).

Figure 2: The performance of difference methods dealing with sets of proteins with difference annotation richness.

Figure 2: The performance of difference methods dealing with sets of proteins with difference annotation richness.

The results in this figure show that SSDD exhibits only a slight decrease in Pearson Correlation Coefficient to a set of reference similarity values for proteins which are less well annotated. This ability to deal with badly annotated proteins is the true value of the SSDD method proposed by Xu et al. However, this investigation was performed by sets of proteins selected by the authors, and should thus be validated independently to confirm these surprising results.

Journal Club: Ligand placement based on prior structures: the guided ligand-replacement method

Last week I presented a paper by Klei et al. on a new module in the Phenix software suite. This module, entitled Guided Ligand-Replacement (GLR), aims to make it easier to place ligands during the crystallographic model-building process by using homologous models of the ligand-protein complex for the initial placement of the ligand.

In the situation where ligands are being added to a crystallographic protein model, a crystallographer must first build the protein model, identify the difference electron density, and then build the ligand into this density.

The GLR approach is particularly helpful in several cases:

  • In the case of large complex ligands, which have many degrees of freedom, it can take a long time to fit the ligand into the electron density. There may be many different conformations of the ligand that fit the difference electron density to a reasonable degree, and it is the job of the crystallographer to explore these different conformations. They must then identify the true model, or perhaps an ensemble of models in the case where the ligand is mobile or present in different, distinct, binding modes. GLR makes this process easier by using a template from a similar, previously-solved structure. The ligand position and orientation is then transplanted to the new structure to give a starting point for the crystallographer, reducing the tedium in the initial placing the ligand.
  • In the case of a series of related crystal structures, where the same protein structure is determined a number of times, bound to different (but similar) ligands. This is common in the case of structure based drug-design (SBDD), where a compound is developed and elaborated upon to improve binding affinity and specificity to a particular protein. This process generates a series of crystal structures of the protein, bound to a series of ligands, where the binding modes of the ligands are similar in all of the structures. Therefore, using the position and orientation of the ligand from a structure is a good starting point for the placement of further elaborations of that ligand in subsequent structures.
  • In the case of several copies of the protein in the asymmetric unit cell of the crystal. After one copy of the ligand has been built, it can be quickly populated throughout the unit cell, removing the need for the crystallographer to undertake this menial and tedious task.

Program Description:

The required inputs for GLR are standard, as required by any ligand-fitting algorithm, namely:

  • The APO structure of the protein (the structure of the protein without the ligand)
  • A description of the ligand (whether as a SMILES string, or as a cif file etc)
  • An mtz file containing the experimental diffraction data

Overview of the program:

GLR Program Overview

Fig 1. Program Overview.

> Identification of the reference structure

Firstly, the program must determine the reference structure to be used as a template. This can be specified by the user, or GLR can search a variety of sources to find the best template. The template selection process is outlined below. Reference structures are filtered by the protein sequence identity, similarity of the molecular weights of the ligands, and finally by the similarity of the binary chemical fingerprints of the ligands (as calculated by the Tanimoto coefficient).

Template Selection

Fig 2. Reference Structure selection flow diagram.

Little justification is given for these cutoffs, although it is generally accepted that proteins with above 70% sequence identity are highly structurally similar. The Tanimoto coefficient cutoff of 0.7 presumably only serves to remove the possibly of very low scoring matches, as if multiple potential reference structures are available, the highest Tanimoto-scored ligand-match is used. They do not, however, say how they balance the choice in the final stage where they take the ligand with the highest Tanimoto score and resolution.

The method for assigning the binary chemical fingerprints can be found here (small error in link in paper).

> Superposition of Reference and Target structures

Once a reference structure has been selected, GLR uses graph-matching techniques from eLBOW to find the correspondences between atoms in the reference and target ligands. These atomic mappings are used to orient and map the target ligand onto the reference ligand.

Once the reference protein-ligand structure is superposed onto the target protein, these atomic mappings are used to place the target ligand.

The target complex then undergoes a real-space refinement to adjust the newly-placed ligand to the electron density. This allows the parts of the target ligand that differ from the reference ligand to adopt the correct orientation (as they will have been orientated arbitrarily by the graph-matching and superposition algorithms).

> Summary, Problems & Limitations

GLR allows the rapid placement of ligands when a homologous complex is available. This reduces the need for computationally intensive ligand-fitting programs, or for tedious manual building.

For complexes where a homologous complex is available, GLR will be able to quickly provide the crystallographer with a potential placement of the ligand. However, at the moment, GLR does not perform any checks on the validity of the placement. There is no culling of the placed ligands based on their agreement with the electron density, and the decision as to whether to accept the placement is left to the crystallographer.

As the authors recognise in the paper, there is the problem that GLR currently removes any overlapping ligands that are placed by the program. This means that GLR is unable to generate multiple conformations of the target ligand, as all but one will be removed (that which agrees best with the electron density). As such, the crystallographer will still need to check whether the proposed orientation of the ligand is the only conformation present, or whether they must build additional models of the ligand.

As it is, GLR seems to be a useful time-saving tool for crystallographic structure solution. Although it is possible to incorporate the tool into automated pipelines, I feel that it will be mainly used in manual model-building, due to the problems above that require regular checking by the crystallographer.

There are several additions that could be made to overcome the current limits of the program, as identified in the paper. These mainly centre around generating multiple conformations and validating the placed ligands. If implemented, GLR will become a highly useful module for the solution of protein-ligand complexes, especially as the number of structures with ligands in the PDB continues to grow.

Journal Club: Human Germline Antibody Gene Segments Encode Polyspecific Antibodies

This week’s paper by Willis et al. sought to investigate how our limited antibody-encoding gene repertoire has the ability to recognise the unlimited array of antigens. There is a finite number of V, D, and J genes that encode our antibodies, but it still has the capacity to recognise an infinite number of antigens. Simply, the authors’ notion is that an antibody from the germline (via V(D)J recombination; see entry by James) is able to adopt multiple conformations, thus allowing the antibody to bind multiple antigens.

Three antibodies derived from the germline gene 5*51-01, all binding to very different antigens.

Three antibodies derived from the germline gene 5*51-01 bind to very different antigens.

To test this hypothesis, the authors performed a multiple sequence alignment for the amino acid sequence between the mature antibodies and the germline antibody sequence from which the antibodies are derived from. if a single position from ONE mature antibody showed a difference to the germline sequence, it was identified as a ‘variable’ position, and allowed to be changed by Rosetta’s multi-state design (MSD) and single-state design (SSD) protocols.

Pipeline: align mature antibodies (2XWT, 2B1A, 3HMX) to the germline sequence (5-51) , identify 'variable' positions from the alignment, then allow Rosetta to change those residues during design.

Figure 1) from Willis et al., showing the pipeline: align mature antibodies (2XWT, 2B1A, 3HMX) to the germline sequence (5-51) , identify ‘variable’ positions from the alignment, then allow Rosetta to change those residues.

Surprisingly, without any prior information of the germline sequence, the MSD yielded a sequence that was closer to the germline sequence, and the SSD for each mature antibody had retained the mature sequence. In short, this indicated that the germline sequence is a harmonising sequence that can accommodate the conformations of each of the mature antibodies (as proven by MSD), whereas the mature sequence was the lowest energy amino acid sequence for the particular antibody’s conformation (as proven by SSD).

To further demonstrate that the germline sequence is indeed the more ‘flexible’ sequence, the authors then aligned the mature antibodies and determined the deviation in ψ-ϕ angles at each of the variable positions that were used in the Rosetta study. They found that the ψ-ϕ angle deviation in the positions that recovered to the germline residue was much larger than the other variable positions along the antibody. In other words, for the positions that tend to return to the germline amino acid in MSD, the ψ-ϕ angles have a much larger degree of variation compared to the other variable positions, suggesting that the positions that returned to the germline amino acid are prone to lots of movement.

In addition to the many results that corroborate the findings mentioned in this entry, it’s neat that the authors took a ‘backwards’ spin to conventional antibody design. Most antibody design regimes aim to find amino acid(s) that give the antibody more ‘rigidity’, and hence, mature its affinity, but this paper went against the norm to find the most FLEXIBLE antibody (the most likely germline predecessor*). Effectively, they argue that this type of protocol can be exported to extract new antibodies that can bind to multiple antigens, thus increasing the versatility of antibodies as potential therapeutic agents.