# A new method to improve network topological similarity search: applied to fold recognition

Last week I discussed the recent paper by Lhota et al. proposing a network-based similarity search method, applied to the problem of protein fold prediction. Similarity search is the foundation of bioinformatics. It plays a key role in establishing structural, functional and evolutionary relationships between biological sequences. Although the power of the similarity search has increased steadily in recent years, a high percentage of sequences remain uncharacterized in the protein universe. Cumulative evidence suggests that the protein universe is continuous. As a result, conventional sequence homology search methods may be not able to detect novel structural, functional and evolutionary relationships between proteins from weak and noisy sequence signals. To overcome the limitations in existing similarity search methods, the authors propose a new algorithmic framework, Enrichment of Network Topological Similarity (ENTS). While the method is general in scope, in the paper, authors focus exclusively on the protein fold recognition problem.

Fig 1: ENTS pipeline for protein fold prediction.

To initialize ENTS for structure prediction, ENTS builds a structural similarity graph of protein domains (Fig 1). The structural similarity graph is a weighted graph with one node for each structural domain and an edge between two nodes only if their pairwise similarity exceeds a certain threshold. In this article, the structural similarity score is determined by TM align with a threshold of 0.4. Next, some or all the structural domains in the database are labeled with SCOP. Given a query domain sequence and the goal to predict its structure, ENTS first links the query to all nodes in the structural similarity graph. The weights of these new edges are based only on the sequence profile-profile similarity derived from HHSearch. Then random walk with restart (RWR) is applied to perform a probabilistic traversal of the instance graph across all paths leading away from the query, where the probability of choosing an edge will be proportional to its weight. The algorithm will output a ranked list of probabilities of reaching each node in the structural graph from the query, thus potentially uncovering relationships missed by pair-wise comparison methods. ENTS also uses an enrichment analysis step to assess the reliability of the detected relationships, by comparing the mean relationship strength of a SCOP cluster in the structural graph and the query, to that of random clusters.

For testing the method, the authors first constructed a structural graph using 36,003 non-redundant protein domains from the PDB. The query benchmark set consisted of 885 SCOP domains, constructed by randomly selecting each domain from folds spanning at least two super-families. An additional step before prediction on the query set was to remove all domains from the structral graph which were in the same super-family as the query. The method was compared to existing methods such as CNFPred and HHSearch and it’s network approach and enrichment analysis step were found to contribute significantly to the accuracy of fold prediction. While the method seems to be an improvement on existing methods and is a novel use of network-based approaches to fold prediction, the false positive rate is still very high. One way of overcoming this, suggested by the authos is the use of energy-based scoring functions to further prune the list of potential hits returned by the method.

# What does a farm look like? – Evaluating Community Detection Methods

Let’s assume you have a child. Tom is 5 years old, he’s a piece of work. He loves running around the house and the only time you can get a bit of rest is when he’s drawing. Tom loves drawing, so you use that whenever you need a bit of time to sit down. Today is one of those days, you didn’t sleep well and there was trouble at work. However, when you pick up Tom from daycare, he’s excited and full of energy, so you suggest Tom draw a farm which you can put on the wall in your office at work. You sit down and Tom is busy for half an hour. After half an hour Tom proudly presents his work. There are pigs in sties, horses in stables and cows on the meadow. He looks up at you and asks: “Is this a good farm?”. Of course, you say it is an amazing farm.

Only, what if you didn’t know what a farm looked like? You could ask the neighbour’s kid, Emily, to draw a farm and see what the images have in common? You could do this at a whole different scale and start a drawing competition for all the 5-year-olds in the neighbourhood and look which 5-year-olds draw most accurately to see who you should believe what a farm looks like. Clearly we’re not talking about children drawing farms any more. But what if Tom were a functional similarity metric that has just evaluated the results of a community detection algorithm run on a protein interaction network to generate communities?

That was a bit of a jump there. Let me explain. I have spent the last 2 years looking into how protein interaction networks (pins) can be partitioned into functional biological modules. It is widely believed that functional modules are an important scale of organization in humans or any other organisms (c.f. Hartwell et al 1999). These modules consist of proteins which perform the same or similar biological functions via interacting with each other. Thus, there may be a module which contains proteins that interact to e.g. heal wounds.

Finding Functional Modules

We attempt to find these modules by looking at a network which contains data on which proteins interact with which other ones (pins) and then use algorithms to group proteins together based on these interactions. The resulting protein communities contain proteins that interact more with each other than with the rest of the network. So that’s that, right?

…No, sadly it isn’t. The issue is that there are probably tens, if not hundreds of algorithms to find these communities and they don’t agree with each other. Furthermore, the underlying network contains a lot of false interactions and is missing a lot of true interactions. This affects different community detection methods in different ways. So we need a way to evaluate the results these community detection algorithms are presenting. But how do we judge what a good community is when it is exactly this community that we are looking for? How do we know it is “a good farm”, when we don’t know what a farm looks like?

Evaluating Community Detection by Functional Similarity

Luckily we do have some idea of what is “good”. Most proteins have annotations which suggest what biological functions they are involved in. As we are looking for functional modules which group together proteins involved in the same or a similar function, this is exactly along our alleyway! Lucky us :). Right?

Again, you might have expected this one, the answer is “not entirely”. Not only are there a lot of ways to find communities in a network, there are also a huge number of ways to use the above-mentioned functional annotations (GO annotations, www.geneontology.org) to calculate how “functionally similar” two proteins are. Now you might think that all these functional similarity measures use the same functional annotation data, so they should generally agree on what is a “good” and what is a “bad” community. This was my first intuition, so I checked. Below are two graphs which show the results of this test. In both cases I evaluate exactly the same sets of communities which I get from partitioning a pin using Link Clustering (Ahn et al 2010) at different resolutions. The plots show the average functional homogeneity of communities in different community size bands as judged by the Pandey Method on the left (Pandey et al 2008), and the simGIC method on the right (Pesquita et al 2007).

Functional homogeneity plots of a protein interaction network partitioned into communities by Link Clustering at different resolutions. The average functional homogeneity of communities is shown in different community size bands to give the coloured lines. Functional homogeneity is calculated as the average pairwise functional similarity between all protein pairs in a community. In Figure A) the Pandey method is used to calculate functional similarity between two proteins, while in Figure B) the simGIC method is used

You can clearly see that the two plots look different. Now it would be okay if they looked different and still agreed on the most important part: where is my pin partitioned into communities in the best way? To check this I look for maxima in the plots. Figure A) tells me that at a resolution of about 0.2 communities of size 2-35 are on average quite functionally homogeneous. At a slightly lower resolution (approx 0.1) communities of size 36+ look like they are partitioned decently. So depending on the community sizes I’m looking for, I can say which resolution I should go for. However, these peaks don’t show up in Figure B). We clearly need an evaluation of the evaluation metric. Which one should I believe?

Features that are common to both cases (the yellow peak around 0.4 and the magenta maximum around 0.3) might be “true”. But what is to say that the first set of peaks isn’t also important? In our earlier analogy, the neighbour’s daughter Emily has drawn a farm that looks quite different. You’re not about to say Tom’s wrong only because Emily drew a different picture, right? So right about now we need that drawing competition!

Evaluating Evaluation metrics: The drawing competition

Now we’ve stumbled twice already in how to evaluate our results. This time, we should make sure that we can evaluate the different results confidently. The plan is that we let the kids draw something else. Not a farm¸ because we don’t know how that looks like. We ask them to draw something related, like a house. We live in a house, so that should be easier to evaluate. And apparently houses and farms are not too dissimilar, so that the kids that do well in their house drawings, may be the ones that have the best farm drawings as well (or so we hope).

In terms of PINs, we have to create a network with community structure which looks a bit like a PIN. We can then use a community detection algorithm to find the communities and let our kids (functional similarity metrics) go away and evaluate the communities at different resolutions. This time we however know which maxima should show up, as we created the network and therefore know the community structure that should be found.

To create a PIN-similar network is not a mean feat and can be the topic of a whole PhD thesis, so I have focussed on a small, simple network, which is hopefully PIN-similar enough to be a meaningful evaluation for the functional similarity metrics. My network is generated on the basis of functional labels ordered in the ontology below.

A tree defining the relationship between labels 1-15. Nodes are annotated with these labels to generate a network. Labels which share more parent labels are closer related (i.e. 14 and 15 share the parent label 5).

Each node is assigned one or two labels between 6 and 15 randomly without replacement. Then the ontology is traced upwards to get the ancestral label sets associated with each node (i.e. label 15 has the ancestral labels 5 and 1). Edge probabilities are calculated between two nodes based on the number of common labels these nodes have in their ancestral label sets. Finally edges are added based on the edge probabilities to give a network with a density comparable to a PIN.

When this network is clustered into communities and functionally evaluated at different resolutions, the results look like this:

Average functional homogeneity of communities of size 3+ in a PIN-like random graph of 500 nodes. The top left graph (black) is based on label overlaps and is thus ground truth, while the other three are generated by the Pandey method (red), simGIC (blue), and simUI (green). The coloured graphs are generated by using the above-mentioned functional similarity metrics after 33% of the labels from 6-15 and 20% of the labels from 2-5 where “hidden” (reverted to the parent label).

The black graph shows how many labels nodes in the same community have in common at different resolutions. As these labels overlaps were used to generate the edge probabilities, this graph serves as a gold standard. The other three graphs were generated using the functional similarity metrics to be compared. To make the PIN-like network more realistic some labels were hidden, as we don’t always know the exact function of every protein when we evaluate PINs.

Now that we have the results, we just need to see how all the coloured graphs compare to the black ones, or in our analogy: how all the kid’s drawings compare to what our house looks like. But because we are doing science, one drawing competition sadly won’t do. What if one child draws our house very well, but happens to do a bit worse exactly in the parts where the house is similar to a farm. We’d think he’s the guy to listen to, and start thinking a farm looks like a shed?!

What we really need, is more drawing competitions. Lots of them. And this is where I am happy I’m at a computer where running one of these competitions takes maybe a minute. To get a bit more confidence in the results, I ran every simulation 100 times, for 27 different sets of parameters. And the results? Well, that’s the best bit. The results say Tom was drawing a great farm all along. Tom, the kid you’ve seen grow up and been bringing to daycare for years now, he was right. You weren’t lying at all when you said it was a great farm. Only in real life he’s not called Tom, he’s called the Pandey method ;).

# Graphical User Interface for MOSAICS as a Pymol plugin — PymoSAICS

MOSAICS is suite of sampling methods for molecular simulations of motion of nucleic acid and protein structures. It’s applicability has been demonstrated in simulating large ensembles of nucleic acids (Sim 2012, Minary 2014) and proteins.

Starting with a protein/dna/rna structure you would like to examine, the basic modus operandi of MOSAICS is divided into three parts:

• Pick your energy function or a statistical/empirical potentials (e.g. empirical Amber or CHARMM)
• Pick your sampling methodology — e.g. parallel tempering
• Details of simulation: solvent (implicit?), degrees of freedom (cartesian, torsional?) etc.

The energy function defines the energy surface with respect to your degrees of freedom (DoFs) and the sampling methodology is supposed to explore the conformational space along DoFs.

One of the main fortes of MOSAICS lies in the ability of defining hierarchichal natural moves. Defining regions of collective motion introduces experimental knowledge and intuition into the simulations, greatly accelerating sampling. Ability to define such regions  was one of the main reasons to start the development of the graphical user interface (GUI) for MOSAICS — preliminary version can be seen in the Figure below.

Overview of PymoSAICS in its current form.

Since we are developing the GUI as a plugin for Pymol, we called it PymoSAICS. The initial focus of the project is on nucleic acids due to our interests in the structural effects of epigenetic modifications. As demonstrated in the Figure above, the GUI is divided into three main panels:

• Current run — prepare a simulation
• Simulation Manager — manage previous runs, import, export protocols
• Help — That’s just a link to our website!

Users can upload their favorite structure via PymoSAICS or Pymol and play with the available parameters. The GUI is also an ongoing effort to streamline the available protocols in MOSAICS to shield the user from the many parameters that are available but perhaps not relevant to the simulation at hand.

We are currently starting beta tests of the application which (if you don’t mind not getting any support just yet) is available here, Therefore, if you are interested in becoming a tester please let me know, and you will receive a version around Easter (April-ish)! Contact me via konrad.krawczyk at dtc.ox.ac.uk

# A topology-based distance measure for network data

In last week’s group meeting, I introduced our network comparison method (Netdis) and presented some new results that enable the method to be applied to larger networks.

The most tractable methods for network comparison are those which compare at the level of the entire network using statistics that describe global properties, but these statistics are not sensitive enough to be able to reconstruct phylogeny or shed light on evolutionary processes. In contrast, there are several network alignment based methods that compare networks using the properties of the individual proteins (nodes) e.g. local network similarity and/or protein functional or sequence similarity. The aim of these methods is to identify matching proteins/nodes between networks and use these to identify exact or close sub-network matches. These methods are usually computationally intensive and tend to yield an alignment which contains only a relatively small proportion of the network, although this has been alleviated to some extent in more recent methods.

Thus, we do not follow the network alignment paradigm, but instead we take our lead from alignment-free sequence comparison methods that have been used to identify evolutionary relationships. Alignment-free methods based on k-tuple counts (also called k-grams or k-words) have been applied to construct trees from sequence data. A key feature is the standardisation of the counts to separate the signal from the background noise. Inspired by alignment-free sequence comparison we use subgraph counts instead of sequence homology or functional one-to-one matches to compare networks. Our proposed method, Netdis, compares the subgraph content not of the networks themselves but instead of the ensemble of all protein neighbourhoods (ego-networks) in each network, through an averaging many-to-many approach. The comparison between these ensembles is summarised in a Netdis value, which in turn is used as input for phylogenetic tree reconstruction.

Fig1: Effect of sub-sampling egos on the resulting grouping of networks generated by Netdis. Higher Rand index values indicate better fit to non-sampling results.

Extensive tests on simulated and empirical data-sets show that it is not necessary to analyze all possible ego-networks within a network for Netdis to work. Our results indicate that in general, randomly sampling around 10% of egos in each network results in a very similar clustering of networks on average, compared to the tree with 100% sampling (Fig 1). This result has important implications for use-cases where eextremely large graphs are to be compared (e.g > 100,000 nodes). Related to the ego-nework sub-sampling idea is the notion of size-limiting the ego-networks that are to be analyzed by the algorithm. Our tests show that the vast majrity of ego-netowrks in most networks have a relatively low coverage of the overall network. Moreover, by introducing lower-size threshold on the egos, we observe better results on average. Together, this means a limited range of ego-network sizes to be analyzed for each network, which should lead to better statitical properties as the sub-sampling scheme is inspired by bootstrapping.

# Building accurate models of membrane protein structures

Today I gave a talk on my research project when I joined the group. My research focuses on modeling of membrane proteins (MPs). Membrane proteins are the main class of drug targets and their mechanism of function is determined by their 3D structure. Almost 30% of the proteins in the sequenced genomes are membrane proteins. But only ~2% of the experimentally determined structures in the PDB are membranes. Therefore, computational methods have been introduced to deal with this limitation.

Homology modeling is one of the best performing computational methods which gives “accurate” models of proteins. Many homology modeling methods have been developed, with Modeller being one of the best known ones. But these methods have been tested and customised primarily on the soluble proteins. As we know there are main physical difference between the MPS and water soluble proteins. Therefor, to build a homology modeling pipeline for membrane proteins, we need a pipeline which in all its steps the unique environment of the membrane protein is taken into account.

Memoir is a tool for homology-based prediction of membrane protein structure (Figure below). As input memoir takes a target sequence and a template. First, using imembrane the lipid bilayer boundaries are detected on the template. Using this information MP-T, with its membrane specific substitution matrices, aligns the target and template. Then, Medeller is used to build the core model and finally FREAD, a fragment-based loop modeling, is used to fill in the missing loops.

Memoir Pipeline

Memoir methodology builds accurate models but potentially incomplete. Homology modeling often entails a trade-off between the level of accuracy and the level of coverage that can be achieved in predicted models. Therefore we aim to build Memoir 2.0, in which we increase coverage by modelling the missing structural information only if such prediction is sensible. Therefore, to complete the models in the best way we aim at:

• 1-Examine the best ways to maximise FREAD coverage, maintaining accuracy
• 2-Examine the best ab initio loop predictor for membrane proteins

• Fread has two main parameters which contribute to its accuracy and coverage. The nature of the chosen database to look for a loop (i.e. membrane or soluble (mem/sol)) and the choice of the sequence identity (ESSS) cut-off:

• ESSS >= 25: more accurate loop models are built (Hiacc)
• ESSS > 0: more coverage is met but not necessary accurate models (Hicov)

• To test the effect of these parameters on the prediction accuracy and coverage we chose to test set:

• Mem_DS: 280 loops taken from MP X-ray structures.
• Model_DS: 156 loops from homology models of MPs. The loop length in both test ranges from 4 to 17 residues

• The comparisons on both dataset confirm that to achieve the highest accuracy and coverage the FREAD Pipeline should be performed as:

• 1. Hiacc-mem
• 2. Hicov-mem
• 3. Hiacc-sol
• 4. Hicov-sol

• Memoir with the new FREAD Pipeline, called Memoir 2.0, achieves higher coverage in comparison to the original Memoir 1.0.

But there are still loops which are not modeled by FREAD Pipeline. These loops should be modeled using an ab initio method. To test the performance of soulable ab initio loop predictors on the membrane proteins we predicted the loops of the above testset sing six ab initio methods available for download: Loopy, LoopBuilder, Mechano, Rapper, Modeller and Plop.

Comparison between ab initio methods on membrane proteins

Comparisons in the image above shows that:

• FREAD is more accurate but, doesn’t achieve complete coverage.
• Greater coverage is achieved using ab initio predictors.
• Mechano, LoopBuilder and Loopy are the best ab initio predictors.

• We have selected Mechano for Memoir 2.0 because it:

• provides higher coverage than Loopy whilst retaining a similar accuracy.
• is faster than LoopBuilder (Mechano is ~30 min faster on loop length of 12)
• is able to model terminals.

• In memoir 2.0 the C and N terminals of up to 8 residues are built using Mechano. Then, Mechano decoy’s are ranked by their Dfire score , and accepted only if they have exited the membrane. This check improves the average RMSD up to 4Å on DS_280 terminals.

In conclusion, Memoir 2.0 provides higher coverage models while maintaining a reasonable accuracy level. Our comparison results showed that FREAD is significantly more accurate than the ab initio methods. But, greater coverage is achieved using ab initio predictors.Comparison oshows that the top ab initio predictors in terms of accuracy are Mechano, LoopBuilder and Loopy. Similar patterns were also present in the model dataset. We have selected Mechano as it provides higher coverage than Loopy whilst retaining a similar accuracy and is also much faster than LoopBuilder. Mechano also has the advantage that it is able to model terminals. Only loops smaller than 17 residues were considered for modelling since above this threshold the accuracy of predicted loops drops significantly.

# Hypotheses and Perspectives onto de novo protein structure prediction

Before I start with my musings about my work and the topic of my D. Phil thesis, I would like to direct you to a couple of previous entries here on BLOPIG. If you are completely new to the field of protein structure prediction or if you just need to refresh your brain a bit, here are two interesting pieces that may give you a bit of context:

A very long introductory post about protein structure prediction

and

de novo Protein Structure Prediction software: an elegant “monkey with a typewriter”

Brilliant! Now, we are ready to start.

In this OPIG group meeting, I presented some results that were obtained during my long quest to predict protein structures.

Of course, no good science can happen without the postulation of question-driving hypotheses. This is where I will start my scientific rant: the underlying hypotheses that inspired me to inquire, investigate, explore, analyse, and repeat. A process all so familiar to many.

As previously discussed (you did read the previous posts as suggested, didn’t you?), de novo protein structure prediction is a very hard problem. Computational approaches often struggle to search the humongous conformational space efficiently. Who can blame them? The number of possible protein conformations is so astronomically large that it would take MUCH longer than the age of the universe to look at every single possible protein conformation.

If we go back to biology, protein molecules are constantly undergoing folding. More so, they manage to do so efficiently and accurately. How is that possible? And can we use that information to improve our computational methods?

The initial hypothesis we formulated in the course of my degree was the following:

“We [the scientific community] can benefit from better understanding the context under which protein molecules are folding in vivo. We can use biology as a source of inspiration to improve existing methods that perform structure prediction.”

Hence came the idea to look at biology and search for inspiration. [Side note: It is my personal belief that there should be a back and forth process, a communication, between computational methods and biology. Biology can inspire computational methods, which in turn can shed light on biological hypotheses that are hard to validate experimentally]

To direct the search for biological inspiration, it was paramount to understand the limitations of current prediction methods. I have narrowed down the limitations of de novo protein structure prediction approaches to three major issues:

1- The heuristics that rely on sampling the conformational space using fragments extracted from know structures will fail when those fragments do not encompass or correctly describe the right answer.

2- Even when the conformational space is reduced, say, to fragment space, the combinatorial problem persists. The energy landscape is rugged and unrepresentative of the actual in vivo landscape. Heuristics are not sampling the conformational space efficiently.

3- Following from the previous point, the reason why the energy landscape is unrepresentative of the in vivo landscape is due to the inaccuracy of the knowledge-based potentials used in de novo structure prediction.

Obviously, there are other relevant issues with de novo structure prediction. Nonetheless, I only have a limited amount of time for my D.Phil and those are the limitations I decided to focus on.

To counter each of these offsets, we have looked for inspiration in biology.

Our understanding from looking at different protein structures is that several conformational constraints are imposed by alpha-helices and beta-strands. That is a consequence of hydrogen bond formation within secondary structure elements. Unsurprisingly, when looking for fragments that represent the correct structure of a protein, it is much easier to identify good fragments for alpha-helical or beta-strand regions. Loop regions, on the other hand, are much harder to be described correctly by fragments extracted from known structures. We have incorporated this important information into a fragment library generation software in an attempt to address limitation number 1.

We have investigated the applicability of a biological hypothesis, cotranslational protein folding, into a structure prediction context. Cotranslational protein folding is the notion that some proteins begin their folding process as they are being synthesised. We further hypothesise that cotranslational protein folding restricts the conformational space, promoting the formation of energetically-favourable intermediates, thus steering the folding path towards the right conformation. This hypothesis has been tested in order to improve the efficiency of the heuristics used to search the conformational space.

Finally, following the current trend in protein structure prediction, we used evolutionary information to improve our knowledge-based potentials. Many methods now consider correlated mutations to improve their predictions, namely the idea that residues that mutate in a correlated fashion present spatial proximity in a protein structure. Multiple sequence alignments and elegant statistical techniques can be used to identify these correlated mutations. There is a substantial amount of evidence that this correlated evolution can significantly improve the output of structure prediction, leading us one step closer to solving the protein structure prediction problem. Incorporating this evolution-based information into our routine assisted us in addressing the lack of precision of existing energy potentials.

Well, does it work? Surprisingly or not, in some cases it does! We have participated in a blind competition: the Critical Assessment for protein Structure Prediction (CASP). This event is rather unique and it brings together the whole structure prediction community. It also enables the community to gauge at how good we are at predicting protein structures. Working with completely blind predictions, we were able to produce one correct answer, which is a good thing (I guess).

All of this comes together nicely in our biologically inspired pipeline to predict protein structures. I like to think of our computational pipeline as a microscope. We can use it to prod and look at biology. We can tinker with hypotheses, implement potentials and test them, see what is useful for us and what isn’t. It may not be exactly what get the papers published, but the investigative character of our structure prediction pipeline is definitely the favourite aspect of my work. It is the aspect that makes me feel like a scientist.

Protein Structure Prediction, my own metaphorical microscope…

# Improving the accuracy of CDR-H3 structure prediction

When designing an antibody for therapeutic use, knowledge of the structure (in particular the binding site) is a huge advantage. Unfortunately, obtaining even one of these structures experimentally, for example by x-ray crystallisation, is very difficult and time-consuming – researchers have therefore been turning to models.

The ‘framework’ regions of antibodies are well conserved between structures, and therefore homology modelling can be used successfully. However, problems arise when modelling the six loops that make up the antigen binding site – called the complementarity determining regions, or CDRs. For five of these loops, only a small number of conformations have actually been observed, forming a set of structural classes – these are known as canonical structures. The class that a CDR loop belongs to can be predicted from its structure, making the prediction of their structures quite accurate. However, this is not the case for the H3 loop (the third CDR of the heavy chain) – there is a much larger structural diversity, making H3 structure prediction a challenging problem.

Antibody structure, showing the six CDR loops that make up the antigen binding site. The H3 loop is found in the centre of the binding site, shown in pink. PDB entry 1IGT.

H3 structure modelling can be considered as a specific case of general protein loop modelling. Starting with the sequence of the loop, and the structure of the remaining parts of the protein, there are three stages in a loop modelling algorithm: conformational sampling, the filtering out of physically unlikely structures, and ranking. There are two types of loop modelling algorithm, which differ in the way they perform the conformational sampling step: knowledge-based methods, and ab initio methods. Knowledge-based methods use databases of known structures to produce loop conformations, while ab initio methods do this computationally, without knowledge of existing structures. My research involves the testing and development of these loop modelling algorithms, with the aim of improving the standard of H3 structure prediction.

MECHANO is an ab initio algorithm that we have developed specifically for H3 loop prediction. Loops are built computationally, by adding residues sequentially onto one of the anchors. For each residue, φ/ψ dihedral angles are chosen from a distribution at random – the distributions used by MECHANO are residue-specific, and are a combination of general loop data and H3 loop data. Loops conformations are closed using a modified cyclic coordinate descent algorithm (CCD), where the dihedrals of each residue are changed, one at a time, to minimise the distance between the free end of the loop and its anchor point, whilst keeping the dihedral angles in the allowed regions of the Ramachandran plot. I have tested MECHANO on the same set of targets as FREAD, generating 5000 loop conformations per target: the average best prediction RMSD was 2.1 Å, and the results showed a clear length dependence – this is expected, since the conformational space to explore becomes larger as the number of residues increases. Even though the average best prediction RMSD is better than that of FREAD, only one of the best RMSDs produced by MECHANO was sub-angstrom, compared to 14 for FREAD. Since the MECHANO algorithm does not depend on previously observed structures, predictions were made for all targets (i.e. coverage = 100%).

My current work is focused upon developing a ‘hybrid’ method, which combines elements of the FREAD and MECHANO algorithms. In this way, we hope to make predictions with the accuracy that can be achieved by FREAD, whilst maintaining 100% coverage. In its current form, the hybrid method, when tested on the 55-loop dataset from before, produces an average best prediction RMSD of 1.68 Å, with 16 targets having a best RMSD of below 1 Å – a very promising result! However, possibly the most difficult part of loop prediction is the ranking of the generated loop structures; i.e. choosing the conformation that is closest to the native. This is therefore my next challenge!

# Predicting Antibody Affinities – Progress?

Antibodies are an invaluable class of therapeutic molecules — they have the capacity to bind any molecule (Martin and Thornton, 1996), and this comes from an antibody’s remarkable diversity (Georgiou et al., 2014). In particular, antibodies can bind their targets with high affinity, and most drug design campaigns seek to ‘mature’ an antibody, i.e. increase the antibody’s affinity for its target. However, the process of antibody maturation is, in experimental terms, time-consuming and expensive — if we had 6 CDRs (as in a typical antibody), with 10 residues each, and if you can have any of the 20 amino acids in the CDR positions, there are 20^60 mutants to test (and this is before considering any double or triple mutations!)

So hold on, what is affinity exactly? Affinity represents the strength of binding, and it’s calculated as either a ratio of concentrations, or as a ratio of rate constants, i.e.In the simplest affinity maturation protocol, three steps are compulsory:

1. Mutate the antibody’s structure correctly
2. Assess the impact of mutation on KD
3. Select and repeat.

For the past year, we have centralised around part 2 — affinity prediction. This is a fundamental aspect of the affinity maturation pipeline in order to rationalise ‘good’ and ‘bad’ mutations in the context of maturing an antibody. We developed a statistical potential, CAPTAIN; essentially the idea is to gather contact statistics that are represented in antibody-antigen complexes, and use this information to predict affinities.

But why use contact information? Does it provide anything useful? Based on analyses of the interfaces of antibody-antigen complexes in comparison to general protein-protein interfaces, we definitely see that antibodies rely on a different binding mode compared to general protein-protein complexes, and other studies have confirmed this notion (Ramaraj et al., 2012; Kunik and Ofran, 2013; Krawczyk et al., 2013).

For our methodology, we trained on general protein-protein complexes (as most scoring functions do!) and specifically on antibody-protein complexes from the PDB. For our test set of antibody-protein complexes, we outperformed 16 other published methods, though for our antibody-peptide test set, we were one of the worst performers. We found that other published methods predict antibody-protein affinities poorly, though they make better predictions for antibody-peptide binding affinities. Ultimately, we achieve stronger performance as we use a more appropriate training set (antibody-antigen complexes) for the problem in hand (predicting antibody-antigen affinities). Our correlations were by no means ideal, and we believe that there are other aspects of antibody structures that must be used for improving affinity prediction, such as conformational entropy (Haidar et al., 2013) and VH-VL domain orientation (Dunbar et al., 2013; Fera et al., 2014).

What’s clear though, is that using the right knowledge base is key to improving predictions for solving greater problems like affinity maturation. At present, most scoring functions are trained on general proteins, but this ‘one-fits-all’ approach has been subject to criticism (Ross et al., 2013). Our work supports the idea that scoring functions should be tailored specifically for the problem in hand.

# Looking for a null model of PPI ego-networks

Protein-protein interaction (PPI) networks describe how proteins are connected to one another in terms of physical interactions. They can be used to aid our understanding of the individual roles of proteins (Sarajli ́c et al., 2013), the co-functioning properties of sets of proteins (West et al., 2013) and even the operation of the complete system (Janowski et al., 2014).

Different approaches have been proposed to analyse, describe and predict these PPI networks, such as network summary statistics, clustering methods, random graph models and machine learning methods. However, despite the large biological, computational and statistical interest in PPI net- works, current models insufficiently describe PPI networks (Winterbach et al., 2013; Ali et al., 2014; Rito et al., 2010). It is commonly accepted that proteins perform functions usually in conjunction with other proteins, forming a functional module (Lewis et al., 2010). Hence local structure is found to be important in protein-protein interaction networks (Planas-Iglesias et al., 2013).

Here we address the modelling problem locally by modelling the ego-networks of PPI networks by means of random graph models.

Random graph models

Loosely speaking, a random graph model is a set of rules that define an edge generation process among a set of nodes. Usually this set of rules relate to particular characteristics that are embedded in the network generation process. Here are three examples of such characteristics:

•  Independence  (each edge has a probability $p$ of being present).
• Preferential attachment (nodes form edges with highly interacting nodes).
• Space-based interactions (an edge is present between two nodes if the distance between them small).

A classical model representing an independence structure is the ER(nv,p) model. This is a random graph on nv nodes, and where edges are present independently at random with probability p.

Now, the preferential attachment characteristic can be illustrated by the Chung-Lu model. That is, given an expected sequence of weights $\{d_1,d_2,...,d_{n_v}\}$. The probability of obtaining an edge between nodes i and j is given by  $P((i,j)\in E)=d_id_j / \sum_j d_j.$

Finally, a model representing a spaced based network generation process could be the Geometric model. Here, nodes are placed uniformly at random in a d-dimensional square $[0,1]^d$. Now, given a radius or threshold distance (r), edges are drawn among nodes $v_i,\,v_j$ $i\neq j$  if $d(v_i,v_j)\leq r$.

From the latter figures it can be seen that different models often lead to different network structures. Thus, although standard random graph models do not reproduce a sufficiently similar network structure to the one of PPI networks, they could still be good approximations for different local regions in a PPI network.

Finding a null model for PPI ego-networks

Our approach consist in finding local regions of the PPI networks that could be represented well by the random graph models. To do so, we propose to extract all 2-step ego-networks and classifying them according to some simple characteristic, network density for example.

Now, once the ego-networks of the PPI network have been extracted and binned according to their network density (ego-density). We assess the fit of the model to the PPI networks by comparing each bin of PPI ego-networks to the ego-networks extracted from a random graph model. This comparison is made by comparing the difference in the resulting number of subgraph counts, triangles for example, in each of the ego-networks within each bin.

The following figure illustrates the underlying idea of this procedure:

Following this approach we aim to find bins for which, possibly different models, reproduce similar subgraph counts as the ones obtained in the PPI ego-networks. However we expect to fin bins for which none of the standard models fit.

# OOMMPPAA: A tool to aid directed synthesis by the combined analysis of activity and structural data

Motivation

Recently I published a paper on OOMMPPAA, my 3D matched-molecular pair tool to aid drug discovery. The tool is available to try here, download here and read about here. OOMMPPAA aims to tackle the big-data problem in drug discovery – where X-ray structures and activity data have become increasingly abundant. Below I will explain what a 3D MMP is.

What are MMPs?

MMPs are two compounds that are identical apart from one structural change, as shown in the example below. Across a set of a thousand compounds, tens of thousands of MMPs can be found. Each pair can be represented as a transformation. Below would be Br >> OH. Within the dataset possibly hundreds of Br >> OH will exist.

An example of an MMP

Each of these transformations will also be associated with a change in a measurable compound property, (solubility, acidity, binding energy, number of bromine atoms…). Each general transformation (e.g. Br>>OH) therefore would have a distribution of values for a given property. From this distribution we can infer the likely effect of making this change on an unmeasured compound. From all the distributions (for all the transformations) we can then predict the most likely transformation to improve the property we are interested in.

The issue with MMPs

Until recently the MMP approach had been of limited use for predicting compound binding energies. This is for two core reasons.

1) Most distributions would be pretty well normally distributed around zero. Binding is complex and the same change can have both positive and negative effects.

2) Those distributions that are overwhelmingly positive, by definition, produce increased binding against many proteins. A key aim of drug discovery is to produce selective compounds that increase binding energy only for the protein you are interested in. So increasing binding energy like that is not overall very useful.

3D MMPs to the rescue

3D MMPs aim to resolve these issues and allow MMP analysis to be applied to binding energy predictions. 3D MMPs use structural data to place the transformations in the context of the protein. One method, VAMMPIRE, asks what is the overall effect of Br>>OH when it is near to a Leucine, Tyrosine and Tryptophan (for example). In this way selective changes can be found.

Another method by BMS aggregates these changes across a target class, in their cases kinases. From this they can ask questions like, “Is it overall beneficial to add a cyclo-proply amide in this region of the kinase binding site”.

What does my tool do differently?

OOMMPPAA differs from the above in two core ways. First, OOMMPPAA uses a pharmacophore abstraction to analyse changes between molecules. This is an effective way of increasing the number of observations for each transition. Secondly OOMMPPAA does not aggregate activity changes into general trends but considers positive and negative activity changes separately. We show in the paper that this allows for more nuanced analysis of confounding factors in the available data.