Improving the precision of local community detection

Communities are everywhere. You can find groups of connected people on Facebook who went to university together, communities of people on youtube that like the same band, sets of products on amazon that tend to be purchased together, or modules of proteins that act in the same biological process in the human body. With the amount of data we can collect on these networks (facebook friendship, protein interactions, etc), you would think we can easily recover these groups. You would think wrong…

There are a lot of methods out there to find communities in networks. As a paper that compares community detection methods by Lancichinetti and Fortunato put it: “A thorough analysis of all existing methods would require years of work (during which many new methods would appear), and we do not believe it is feasible”. Thus, clearly we have not found the answer.

So what do people generally do to find these communities? Well… very different things. To explain a few of the approaches, it helps to break them down into broad categories such as global and local methods.


Global Methods

Global community detection methods generally aim to optimized an objective function over the entire network. For example, I could decide that I will score the network partition based on a function I will call “greatness”, G. I will define the greatness of a community by the sum of the number of nodes each node is connected with that are in the community (the sum of the node in-degree). Or simply:

Greatness Equation

Here, A_ij is the adjacency matrix that contains a 1 where nodes i and j interact, and 0 otherwise. The sigmas in the equation correspond to the community identifiers of nodes i and j respectively, and thus the delta function is only 1 when sigma_i and sigma_j are equal, meaning nodes i and j are in the same community, and 0 otherwise.

If you now come to the conclusion that this is a pretty stupid thing to define, then you’ve understood the concept of a global method (alternatively… get rid of that negativity!). A global community detection method would now look how to partition the network so that we can maximize G. In the case of greatness defined here, everything will be put into the same community, as there’s no penalty for putting two nodes together that shouldn’t belong together. Adding a further node to a community can never decrease the greatness score!

If you want a real overview over these and other methods, check out this 100-page review article.


Local Methods

In contrast to global methods that attempt to gather information on the whole network, local community detection methods dive into the network to try to optimize the community partitioning. This means on the one hand they can be very fast at assigning nodes to communities, as they take into account less information. However, on the other hand using less information in this assignment means they can be less precise.

An example of a local method is link clustering, which I described in an earlier post. Link clustering assesses the similarity of two edges that share a node, rather than comparing the nodes themselves. The similarity of these edges is calculated based on how similar the surroundings of the nodes on the endpoints are. Using this score, all edges that are more similar than a cutoff value, S, can be clustered together to define the edge-communities at resolution S. This partitioning translates to nodes by taking the nodes that the edges in the edge-communities connect.


Improving local community detection

Link clustering is a pretty nice and straightforward example of how you can successfully find communities. However, the authors admit it works better on dense than on sparse networks. This is an expected side-effect of taking into account less information to be faster as mentioned above. So you can improve it with a pretty simple idea: how much more information can I take into account without basically looking at the whole network?

Local community detection methods such as link clustering are generally extended by including more information. One approach is to include information other than network topology in the local clustering process (like this). This approach requires information to be present apart from the network. In my work, I like using orthogonal data sets to validate community detection, rather than integrate it into the process and be stuck without enough of an idea if it actually worked. So, instead I’m trying to include more of the local surroundings of the node to increase the sensitivity of the similarity measure between edges. At some point, this will blow up the complexity of the problem and turn it into a global method. So I’m now walking a line between precision and speed. A line that everyone always seems to be walking. On this line, you can’t really win… but maybe it is possible to optimize the trade-off to suit my purposes. And besides… longer algorithm run times give me more time for lunch breaks ;).

Conformational Variation of Protein Loops

Something many structural biologists (including us here in OPIG!) are guilty of is treating proteins as static, rigid structures. In reality, proteins are dynamic molecules, transitioning constantly from  conformation to conformation. Proteins are therefore more accurately represented by an ensemble of structures instead of just one.

In my research, I focus on loop structures, the regions of a protein that connect elements of secondary structure (α-helices and β-sheets). There are many examples in the PDB of proteins with identical sequences, but whose loops have different structures. In many cases, a protein’s function depends on the ability of its loops to adopt different conformations. For example, triosephosphate isomerase, which is an important enzyme in the glycolysis pathway, changes conformation upon ligand binding, shielding the active site from solvent and stabilising the intermediate compound so that catalysis can occur efficiently. Conformational variability helps triosephosphate isomerase to be what is known as a ‘perfect enzyme’; catalysis is limited only by the diffusion rate of the substrate.

Structure of the triosephosphate isomerase enzyme. When the substrate binds, a loop changes from an ‘open’ conformation (pink, PDB entry 1TPD) to a ‘closed’ one (green, 1TRD), which prevents solvent access to the active site and stabilises the intermediate compound of the reaction.

Structure of the triosephosphate isomerase enzyme. When the substrate binds, a loop changes from an ‘open’ conformation (pink, PDB entry 1TPD) to a ‘closed’ one (green, 1TRD), which prevents solvent access to the active site and stabilises the intermediate compound of the reaction.

An interesting example, especially for some of us here at OPIG, is the antibody SPE7. SPE7 is multispecific, meaning it is able to bind to multiple unrelated antigens. It achieves this through conformational diversity. Four binding site conformations have been found, two of which can be observed in its unbound state in equilibrium – one with a flat binding site, and another with a deep, narrow binding site [1].

An antibody that exists as two different structures in equilibrium - one with a shallow binding site (left, blue, PDB code 1OAQ) and one with a deep, narrow cleft (right, green, PDB 1OCW). Complementary determining regions are coloured in each case.

SPE7; an antibody that exists as two different structures in equilibrium – one with a shallow binding site (left, blue, PDB code 1OAQ) and one with a deep, narrow cleft (right, green, PDB 1OCW). Complementary determining regions are coloured in each case.

So when you’re dealing with crystal structures, beware! X-ray structures are averages – each atom position is an average of its position across all unit cells. In addition, thanks to factors such as crystal packing, the conformation that we see may not be representative of the protein in solution. The examples above demonstrate that the sequence/structure relationship is not as clear cut as we are often lead to believe. It is important to consider dynamics and conformational diversity, which may be required for protein function. Always bear in mind that the static appearance of an X-ray structure is not the reality!

[1] James, L C, Roversi, P and Tawfik, D S. Antibody Multispecificity Mediated by Conformational Diversity. Science (2003), 299, 1362-1367.

“Identifying Allosteric Hotspots with Dynamics”: Using molecular simulation to find critical sites for the functional motions in proteins

Allosteric (allo-(“other”) + steric (repulsion of atoms due to closeness or arrangement)) sites regulate protein function from a position other than the active site or binding site. Consider the latch on a pair of gardening scissors (Figure 1): depending on the position of the latch (allosteric site) the blades are prevented from cutting things at the other end (active site).


Figure 1 Allostery explained: A safety latch in gardening scissors.

Due to the non-trivial positions of allosteric sites in proteins their identification has been challenging. Selected well characterised systems such as GPCRs have known allosteric sites that are being used as targets in drug development. However, large scale identification of allosteric sites across the Protein Data Base (PDB) has not yet been feasible, partly because of the lack of tools.

To tackle this problem the Gerstein Lab developed a computational protocol based on various molecular simulations and network methods to find allosteric hotspots in proteins across the PDB. They introduce two different pipelines; one for identifying allosteric residues on the surface (surface-critical) and one for buried residues (interior-critical).

For the search of exterior-critical residues they us MC simulations to repeatedly probe the surface of the protein with short Monte Carlo (MC) with a short peptide. Based on hard spheres and simple energy calculations this method seems to be an efficient way of detecting possible binding pockets. Once the binding pockets have been found, the collective motions of the structure are simulated using an elastic mass-and-spring network (an anisotropic network model [ANM]). Binding pockets that undergo significant deformation during these simulations are considered to be surface-critical.

For interior-critical residues they start by weighting residue-residue contacts on the basis of collective movement. Communities within the weighted network are then identified and the residues with the highest betweenness interactions between communities are chosen as interior-critical residues. Thus, interior-critical residues have the highest information flow between two densely inter-connected groups of residues.

The protocol as been implemented in STRESS (STRucturally identified ESSential residues) and is freely available at


Comp Chem Kitchen

I recently started  “Comp Chem Kitchen” with Richard Cooper and Rob Paton in the Department of Chemistry here in Oxford. It’s a regular forum and seminar series  for molecular geeks and hackers, in the original, untarnished sense of the word: people using and developing computational methods to tackle problems in chemistry, biochemistry and drug discovery. Our hope is that we will share best practices, even code snippets and software tools, and avoid re-inventing wheels.

In addition to local researchers, we invite speakers from industry and non-profits from time to time, and occasionally organize software demos and tutorials.

We also provide refreshments including free beer. (We are grateful to Prof. Phil Biggin and the MRC Proximity to Discovery Fund for offering to support CCK.)


Our first meeting, CCK-1,  was held in the Abbot’s Kitchen on May 24, 2016, at 5 pm, and was a great success—standing room only, in fact! The Abbot’s Kitchen—originally a laboratory—is a beautiful stone building built in 1860 in the Victorian Gothic style, alongside the Natural History Museum, at a time when Chemistry was first recognized as a discipline.

Abbot's Kitchen, Oxford, Watercolor, Detailed, 512x683We heard a fascinating talk from Jerome Wicker from the Department of Chemistry who spoke about “Machine learning for classification of solid form data extracted from CSD and ZINC”, and described a method that could successfully discriminate (~80%) whether a small molecule would crystallize or not. The software tools he discussed included RDKit, CSD, and scikit-learn. There were also two lightning talks, each 5 minutes long, one from OPIG member Hannah Patel,  from the Department of Statistics, on “Novelty Score: Prioritising compounds that potentially form novel protein-ligand interactions and novel scaffolds using an interaction centric approach”, who briefly described her Django-based web interface to her RDKit-based tool to analyse structures of ligands bound to proteins and help guide future medicinal chemistry to find novel compounds. We also had a talk from Dr Michael Charlton from InhibOx spoke about “Antibacterial Drug Discovery and Machine Learning”.


Our next Comp Chem Kitchen, CCK-2, will be held next Tuesday (June 14th, 2016), and you can register free for CCK-2 here.

We will have talks from:


Hope to see you there!  (Did I say free beer?)

RDKit User Group Meeting 2016

If you’re interested in small molecules and cheminformatics, you might like to know that registration for the 2016 RDKit User Group Meeting is now open. The meeting will be held from the 26th-28th of October in Basel, Switzerland. From their announcement:

This year’s RDKit User Group Meeting will take place from 26-28 October at the Novartis Campus in Basel, Switzerland and is co-organized by people from Roche and Novartis.

Registration for the RDKit UGM is free:

The previous years’ format seemed to work pretty well and the feedback was positive, so we will stick to the same format this year:

Days 1 and 2: Talks, lightning talks, roundtable(s), discussion, and talktorials. For those who haven’t attended before, talktorials are somewhere between a talk and a tutorial, they cover something interesting done with the RDKit and include the code used to do the work. During the presentation you’ll give an overview of what you did and also show the pieces of the code that are central to the work. The idea is to mix the science up with the tutorial aspects.

Day 3 will be a sprint: those who choose to stay will spend an intense day working in small groups to produce useful artifacts: new bits of code, KNIME nodes, KNIME workflows, tutorials, documentation, IPython notebooks, etc. We will once again try to structure this a bit by collecting a bunch of ideas for things to work on in advance.

There will also be, of course, social activities.

Looks like fun!

Structure prediction through sequence physical properties

Today, protein structure is mainly predicted by aligning the unknown amino acid sequence against all sequences for which we already know the physical structure. Whilst sequences differing in length can be readily catered for by inserting or deleting (with or without affine gap penalties) the odd amino acid, there will frequently be cases where there are mutations. To compensate for this, the likes of the BLOSUM matrix is used to score the likelihood of one amino acid having mutated into another. For example, a hydrophobic residue is more likely to be swapped for something similar, than it is to be replaced with something strongly hydrophilic.

Whilst this is an entirely reasonable basis, there are many other physical property factors which which can be considered. In fact, the Amino Acid Index currently lists 544 physicochemical and biochemical properties of amino acids. A paper by Yi He et al. PNAS 2015;112:5029-5032 recently made use of a subset of these properties to predict structure. Their work shown below shows target T0797 (B) from CASP 11 compared with a purely physical structure predicted using their method (A) and the PSI BLAST candidate for the same sequence (C).

em structure phys properties

Even though their structure only had three residues in common with the target sequence, it is plainly more similar than the PSI BLAST attempt. The RMSD between structures A and B is also reported as being 0.73 Å, whilst PSI BLAST returns an RMSD of 2.09 Å.

Visualising Biological Data, Pt. 2

Here’s a little quick round-up of some of the tools/algorithms that I’ve seen in VIZBI, which I believe can be useful for many. For more details, I strongly advise you check out the posters page ( There were a few that I would’ve liked to re-visit, but the webapps weren’t available (e.g. MeshCloud from the Human Genome Center, Tokyo), so maybe I’ll come back with a part 3. Here are my top five:

1. Autodesk’s Protein Viewer* (shout-outs to @_merrywang on Twitter)
As a structural bioinformatician, I’m going to be really biased here, and say that Autodesk’s Molecule viewer was the best tool that was showcased in the conference. It combines not only the capacity to visualise millions of molecules from the PDB (or your own PDB files), it also allows annotation and sharing, effectively, “snapshots” of your workspace for collaboration (see this if you want to know what I mean). AND it’s free! It’s not the fastest viewer on the planet, nor the easiest thing to use, but it is effective.

2. Vectorbase
Not related to protein structures, but a really interesting visualisation that shows information on, for example, insecticide resistance. With mosquitoes being such a huge part of today’s news, this kind of information is vital for fighting and understanding the distribution of insects across the globe.

3. Phandango
This is a genome browser which, from a one-man effort, could be a game-changer. The UI needs a little bit of work I think, but otherwise, a really valuable tool for crunching lots of genomic data in a quick fashion.

4. i-PV Circos
This is a neat circular browser that helps users view protein sequences in a circularised format. With this visualisation format becoming more popular as the days go by, I think this has the potential to be a leader in the field. At the moment the website’s a bit dark and not the most user-friendly, but some of the core functionality (e.g. highlighting residues and association of domains) is a real plus!

5. Storyline visualisation
Possibly my favourite/eye-opener tool from the entire conference. Storyline visualisation helps users understand how things progress in realtime — this has been used for movie plot data (e.g. Star Wars character and plot progression) but the general concept can be useful for biological phenomena – for example, how do cells in diseased states progress over time? How does it compare to healthy states? Can we also monitor protein dynamics using a similar concept? I think the fact that it gives a very intuitive, big-picture overview of the micro-scale dynamics was the reason why I’ve been incredibly interested in Kwan-Liu Ma’s work, and I recommend checking out his website/publications list to grab insight on improving data visualisation (in particular, network visualisation when you want to avoid hairballs!)

The list isn’t ranked in any way, and do check these out! There were other tools I would’ve really liked to review (e.g. Minardo, made by David Ma @frostickle on Twitter), but I suppose I can go on and on. At the end of the day, visualisation tools like these are meant to be quick, and help us to not only EXPLORE our data, but to EXPLAIN it too. I think we’re incredibly fortunate to have some amazing minds out there who are willing to not only create these tools, but also make them available for all.

Convergent affinity maturation.

Antibodies are the first line of defense of our organisms against noxious substances. They are the proteins which we ‘train’ to recognize noxious substances when we get immunized. Therefore understanding the immune response after being presented with an antigen is instrumental in developing novel vaccines.

One hypothesis relating to immune response to an antigen is that different organisms are likely to raise similar or even identical antibodies against the same antigen. Testing this hypothesis has become more realistic recently with the advent of Next Generation Sequencing technologies (NGS). Using NGS techniques it is possible to interrogate the sequential makeup of a large set of B-cells.

Such a study was conducted not that long time ago by Trueck et al. They have analysed antibody repertoires of five individuals, pre and post immunization to check if the immune systems converged on similar antibody sequences. The five individuals were immunized with a conjugate vaccine of HiB, MenC and TT. The antibodies were sequenced from cells extracted pre-vaccination and seven days after vaccination.

Firstly, the antibody repertoire appeared to reflect the fact that an organism was mounting an immune response as the clonality post vaccination was higher than before vaccination (more cells producing similar antibodies). Secondly, authors focused on identifying sequences from the public repertoire — those antibodies that are shared between individuals. This analysis focused on the CDR3 only, of which 47 were shared between at least two of the five individuals. Quite a large proportion of those sequences were known to be specific towards HiB and the enrichment of these was muhch higher in the post-vaccination sample. Only one sequence in this set was known previously to target TT. Nevertheless, relaxing the sequence similarity condition, a lot of sequences related to those known to be TT-specific were found among the five individuals. Most importantly, the number of such sequences was much higher in the post-vaccination samples, indicating that these might indeed have been raised in response to TT stimulation. The same was not true for MenC as hardly any sequences related to this antigen were found in the immune response of the five individuals.

Therefore, authors claim that looking at the enrichment of such shared sequences can be an indicator of the effectiveness of the immune response. They correlate statistics coming from looking at the number of shared sequences which appear to have moderate correlation to the antibody avidity data (even though p-values in some cases are quite high). This indicates that even in such a small set of individuals, antibodies are capable of converging on similar solutions. This might provide clues as to the characteristics antibodies that recognize specific antigens and thus facilitate novel vaccine design.




Is “fragment-based” still the way forward in template-free protein structure prediction?

Out of the many questions surrounding the notion that you can predict a protein’s structure from its sequence, there is one in particular that I decided to tackle during last group meeting.

Protein structure prediction is a hard problem (do I sound repetitive?). One of the many cop outs employed by the structure prediction community is the idea that you can break down known structures into fragments and use these protein pieces to perform predictions. This is known as fragment-assembly or fragment-based template-free protein structure prediction.

As absurd as the idea may seem, there is robust evidence that suggests that this is actually a viable strategy. There is a notion that the fragment space is complete; you can reconstruct the backbone of any known structure based on the torsion angles of fragments from other structures. In less technical jargon, you can effectively use fragments and combine them to re-create any of the protein structures that we know and to a fairly acceptable level of precision.

So, technically, it is possible to predict a protein structure using fragments from other structures. In practice, you are still left with the problem of choosing the right fragments to model your sequence of interest. How easy do you think that is?

We can look at this question in light of observations that were made back in the early 80s. Kabsch and Sander reported that two protein fragments having exactly the same sequence can present completely different structures [1]. This complies with the notion that global properties can affect and even define local structure, which in turn suggests that selecting the right fragments to assemble a structure is not necessarily a straightforward process.

The starting point for protein structure prediction is a sequence. Since we are talking about template-free protein structure prediction, it is safe to assume that there is no good global sequence match to your target with a known structure (otherwise you would use that match/structure as a template). Hence, fragment selection is restricted to local sequence similarity, which, as suggested in the previous paragraph, is not necessarily ideal.

On the other hand, we are becoming increasingly more accurate in inferring one-dimensional properties from a protein’s sequence. These properties can and often are used to enhance our fragment-selection capabilities. Yet, even using the state-of-the-art in secondary structure and torsion angle prediction, fragment selection is still fairly imprecise.

During group meeting I highlighted a possible contrast between practical fragment space and general (or possible) fragment space. My premise is simple.  I define practical fragment space as the fragments that we can accurately select from the possible fragment space to model protein structures. In my opinion, it would be extremely interesting to quantify the difference between the two. This would answer the fundamental question of how useful fragment-assembly actually is. More importantly, it would help the community make an educated decision in regards to whether template-free structure prediction strategies should shift from fragment-based to ones based on distance constraints, an approach that is gaining popularity due to the success of contact predictions.

I am very keen to investigate this further. Maybe for my next blog post, we will have an answer! Stay tuned.

[1] Kabsch, Wolfgang, and Christian Sander. “On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations.” Proceedings of the National Academy of Sciences  81.4 (1984): 1075­1078.

Journal Club: “Discriminative Chemical Patterns: Automatic and Interactive Design”

For Journal Club this week I decided to discuss the following paper by M. Rarey et al., which describes a method of using SMARTS patterns to discriminate between two sets of molecules. Link to paper here.

Given two sets of molecules can one generate a pattern that discriminates between two sets? This relates to a key question in drug design: can we predict whether molecules bind or not given a set of binders and a set of non-binders. The method is of particular interest because it makes use of data available, unlike conventional methods. However, for this technique to work, the correct molecular classification is required to discriminate between the two sets of molecules.

Originally molecules were classified using physiochemical properties for example, molecular weight or log P. However these classifications are too general and do not encompass enough molecular detail for accurate discrimination. An alternative is to using topological fingerprints. These encode a set the presence of a set of topological features using a series of bits. One of the limitations with this classification is that it is restricted by the predefined set of structures and features. This method makes use of chemical patterns which advantageously can can classify a chemical feature that cannot be sufficiently described by molecular substructure.

SMARTS (a molecular description language based on SMILES) allow description of structures with varying levels of specificity. For example one can specify atomic element, whether the atom is a subset of elements, whether it is aliphatic or aromatic, or whether it is in a ring. The method makes use of this description of molecules as the group have already developed some software to visualise SMARTS strings and modify them: the SMARTSeditor.

The method involves combining automatic pattern generation and visualisation to form SMARTSminer. Given two distinct molecule sets, the algorithm derives connected chemical patterns to differentiate both sets by using a sub-graph mining technique: solutions are extended by single elements iteratively.

The SMARTSminer was then used to test a series of test cases using the DUD (Database of Useful Decoys) data set. This seems strange when the data set has been shown to be inaccurate and perhaps there are more accurate test sets available, such as DUDe (Database of Useful Decoys enchanced). Let us look a couple of these case sets in more detail.

  1. Discrimination between Active Molecules on Similar Targets

The first case set looks at discriminating between molecules that are active for COX-1 and COX-2. COX proteins are cyclooxygenase that are involved in inflammatory reaction. These proteins are targeted by inhibitors such as aspirin and ibuprofen for the relief of inflammatin and pain. Both COX-1 and COX-2 are similar targets with similar molecular weight and 65% sequence identity. Selective inhibition is only due to a difference in residue at position 523.

Separation of the sets of molecules was possible with a pattern identified that hit 21/25 of the molecules active for COX-1 and 15/348 of molecules of molecules active for COX-2. When the positive and negative set are reversed a pattern is identified that matched 313/348 of COX-2 actives but only 1 of the COX-1 ligands. The group state that perfect separation is not possible as there is an overlap of 2 molecules.

It is interesting that patterns were identified that could discriminate between the two sets. However, there is no discussion of how to use this information. Additionally the pattern determined has not been tested on any molecules outside of the training set – there are no blind tests. This seems strange as a blind test could emphasise the usefulness of this method if it was successful.

2. Discrimination between Active and Inactive Molecules

The second case investigates determining whether a pattern can be generated that discriminates between active and inactive targets. The test case used target SAHH (S-adenosyl-homocysteine hydrolase). A pattern was generated that matched all active molecules and only 1% of inactives. What is particularly exciting is that the pattern found contains part of the interaction network hydrogen bonding partners of the ligand, as shown in the figure below (the pattern identified is highlighted in green).


I find it very surprising that the group did not follow up with blind tests of molecules not used in the training set – especially as the pattern identified a key part of the binding mechanism.

To summarise a new method, SMARTSminer, calculates discriminative patterns between two sets of molecules using the SMARTS language. The authors state that the method has shown applicability in several use cases covering the application of actives vs decoys, kinase classifications, analysis of data sets and characterisation of reaction centers. However, I’m not sure I can agree with that statement. I believe further blind tests would be required to prove the applicability of the method once the pattern has been found. I also believe that an analysis of whether the pattern is over fitted to the training data is also required.