# Improving the precision of local community detection

Communities are everywhere. You can find groups of connected people on Facebook who went to university together, communities of people on youtube that like the same band, sets of products on amazon that tend to be purchased together, or modules of proteins that act in the same biological process in the human body. With the amount of data we can collect on these networks (facebook friendship, protein interactions, etc), you would think we can easily recover these groups. You would think wrong…

There are a lot of methods out there to find communities in networks. As a paper that compares community detection methods by Lancichinetti and Fortunato put it: “A thorough analysis of all existing methods would require years of work (during which many new methods would appear), and we do not believe it is feasible”. Thus, clearly we have not found the answer.

So what do people generally do to find these communities? Well… very different things. To explain a few of the approaches, it helps to break them down into broad categories such as global and local methods.

## Global Methods

Global community detection methods generally aim to optimized an objective function over the entire network. For example, I could decide that I will score the network partition based on a function I will call “greatness”, G. I will define the greatness of a community by the sum of the number of nodes each node is connected with that are in the community (the sum of the node in-degree). Or simply:

Here, A_ij is the adjacency matrix that contains a 1 where nodes i and j interact, and 0 otherwise. The sigmas in the equation correspond to the community identifiers of nodes i and j respectively, and thus the delta function is only 1 when sigma_i and sigma_j are equal, meaning nodes i and j are in the same community, and 0 otherwise.

If you now come to the conclusion that this is a pretty stupid thing to define, then you’ve understood the concept of a global method (alternatively… get rid of that negativity!). A global community detection method would now look how to partition the network so that we can maximize G. In the case of greatness defined here, everything will be put into the same community, as there’s no penalty for putting two nodes together that shouldn’t belong together. Adding a further node to a community can never decrease the greatness score!

If you want a real overview over these and other methods, check out this 100-page review article.

## Local Methods

In contrast to global methods that attempt to gather information on the whole network, local community detection methods dive into the network to try to optimize the community partitioning. This means on the one hand they can be very fast at assigning nodes to communities, as they take into account less information. However, on the other hand using less information in this assignment means they can be less precise.

An example of a local method is link clustering, which I described in an earlier post. Link clustering assesses the similarity of two edges that share a node, rather than comparing the nodes themselves. The similarity of these edges is calculated based on how similar the surroundings of the nodes on the endpoints are. Using this score, all edges that are more similar than a cutoff value, S, can be clustered together to define the edge-communities at resolution S. This partitioning translates to nodes by taking the nodes that the edges in the edge-communities connect.

## Improving local community detection

Link clustering is a pretty nice and straightforward example of how you can successfully find communities. However, the authors admit it works better on dense than on sparse networks. This is an expected side-effect of taking into account less information to be faster as mentioned above. So you can improve it with a pretty simple idea: how much more information can I take into account without basically looking at the whole network?

Local community detection methods such as link clustering are generally extended by including more information. One approach is to include information other than network topology in the local clustering process (like this). This approach requires information to be present apart from the network. In my work, I like using orthogonal data sets to validate community detection, rather than integrate it into the process and be stuck without enough of an idea if it actually worked. So, instead I’m trying to include more of the local surroundings of the node to increase the sensitivity of the similarity measure between edges. At some point, this will blow up the complexity of the problem and turn it into a global method. So I’m now walking a line between precision and speed. A line that everyone always seems to be walking. On this line, you can’t really win… but maybe it is possible to optimize the trade-off to suit my purposes. And besides… longer algorithm run times give me more time for lunch breaks ;).

# Conformational Variation of Protein Loops

Something many structural biologists (including us here in OPIG!) are guilty of is treating proteins as static, rigid structures. In reality, proteins are dynamic molecules, transitioning constantly from  conformation to conformation. Proteins are therefore more accurately represented by an ensemble of structures instead of just one.

In my research, I focus on loop structures, the regions of a protein that connect elements of secondary structure (α-helices and β-sheets). There are many examples in the PDB of proteins with identical sequences, but whose loops have different structures. In many cases, a protein’s function depends on the ability of its loops to adopt different conformations. For example, triosephosphate isomerase, which is an important enzyme in the glycolysis pathway, changes conformation upon ligand binding, shielding the active site from solvent and stabilising the intermediate compound so that catalysis can occur efficiently. Conformational variability helps triosephosphate isomerase to be what is known as a ‘perfect enzyme’; catalysis is limited only by the diffusion rate of the substrate.

Structure of the triosephosphate isomerase enzyme. When the substrate binds, a loop changes from an ‘open’ conformation (pink, PDB entry 1TPD) to a ‘closed’ one (green, 1TRD), which prevents solvent access to the active site and stabilises the intermediate compound of the reaction.

An interesting example, especially for some of us here at OPIG, is the antibody SPE7. SPE7 is multispecific, meaning it is able to bind to multiple unrelated antigens. It achieves this through conformational diversity. Four binding site conformations have been found, two of which can be observed in its unbound state in equilibrium – one with a flat binding site, and another with a deep, narrow binding site [1].

SPE7; an antibody that exists as two different structures in equilibrium – one with a shallow binding site (left, blue, PDB code 1OAQ) and one with a deep, narrow cleft (right, green, PDB 1OCW). Complementary determining regions are coloured in each case.

So when you’re dealing with crystal structures, beware! X-ray structures are averages – each atom position is an average of its position across all unit cells. In addition, thanks to factors such as crystal packing, the conformation that we see may not be representative of the protein in solution. The examples above demonstrate that the sequence/structure relationship is not as clear cut as we are often lead to believe. It is important to consider dynamics and conformational diversity, which may be required for protein function. Always bear in mind that the static appearance of an X-ray structure is not the reality!

[1] James, L C, Roversi, P and Tawfik, D S. Antibody Multispecificity Mediated by Conformational Diversity. Science (2003), 299, 1362-1367.

# “Identifying Allosteric Hotspots with Dynamics”: Using molecular simulation to find critical sites for the functional motions in proteins

Allosteric (allo-(“other”) + steric (repulsion of atoms due to closeness or arrangement)) sites regulate protein function from a position other than the active site or binding site. Consider the latch on a pair of gardening scissors (Figure 1): depending on the position of the latch (allosteric site) the blades are prevented from cutting things at the other end (active site).

Figure 1 Allostery explained: A safety latch in gardening scissors.

Due to the non-trivial positions of allosteric sites in proteins their identification has been challenging. Selected well characterised systems such as GPCRs have known allosteric sites that are being used as targets in drug development. However, large scale identification of allosteric sites across the Protein Data Base (PDB) has not yet been feasible, partly because of the lack of tools.

To tackle this problem the Gerstein Lab developed a computational protocol based on various molecular simulations and network methods to find allosteric hotspots in proteins across the PDB. They introduce two different pipelines; one for identifying allosteric residues on the surface (surface-critical) and one for buried residues (interior-critical).

For the search of exterior-critical residues they us MC simulations to repeatedly probe the surface of the protein with short Monte Carlo (MC) with a short peptide. Based on hard spheres and simple energy calculations this method seems to be an efficient way of detecting possible binding pockets. Once the binding pockets have been found, the collective motions of the structure are simulated using an elastic mass-and-spring network (an anisotropic network model [ANM]). Binding pockets that undergo significant deformation during these simulations are considered to be surface-critical.

For interior-critical residues they start by weighting residue-residue contacts on the basis of collective movement. Communities within the weighted network are then identified and the residues with the highest betweenness interactions between communities are chosen as interior-critical residues. Thus, interior-critical residues have the highest information flow between two densely inter-connected groups of residues.

The protocol as been implemented in STRESS (STRucturally identified ESSential residues) and is freely available at stress.molmovdb.org.

Publication: http://www.ncbi.nlm.nih.gov/pubmed/27066750

# Comp Chem Kitchen

I recently started  “Comp Chem Kitchen” with Richard Cooper and Rob Paton in the Department of Chemistry here in Oxford. It’s a regular forum and seminar series  for molecular geeks and hackers, in the original, untarnished sense of the word: people using and developing computational methods to tackle problems in chemistry, biochemistry and drug discovery. Our hope is that we will share best practices, even code snippets and software tools, and avoid re-inventing wheels.

In addition to local researchers, we invite speakers from industry and non-profits from time to time, and occasionally organize software demos and tutorials.

We also provide refreshments including free beer. (We are grateful to Prof. Phil Biggin and the MRC Proximity to Discovery Fund for offering to support CCK.)

CCK-1

Our first meeting, CCK-1,  was held in the Abbot’s Kitchen on May 24, 2016, at 5 pm, and was a great success—standing room only, in fact! The Abbot’s Kitchen—originally a laboratory—is a beautiful stone building built in 1860 in the Victorian Gothic style, alongside the Natural History Museum, at a time when Chemistry was first recognized as a discipline.

We heard a fascinating talk from Jerome Wicker from the Department of Chemistry who spoke about “Machine learning for classification of solid form data extracted from CSD and ZINC”, and described a method that could successfully discriminate (~80%) whether a small molecule would crystallize or not. The software tools he discussed included RDKit, CSD, and scikit-learn. There were also two lightning talks, each 5 minutes long, one from OPIG member Hannah Patel,  from the Department of Statistics, on “Novelty Score: Prioritising compounds that potentially form novel protein-ligand interactions and novel scaffolds using an interaction centric approach”, who briefly described her Django-based web interface to her RDKit-based tool to analyse structures of ligands bound to proteins and help guide future medicinal chemistry to find novel compounds. We also had a talk from Dr Michael Charlton from InhibOx spoke about “Antibacterial Drug Discovery and Machine Learning”.

CCK-2

Our next Comp Chem Kitchen, CCK-2, will be held next Tuesday (June 14th, 2016), and you can register free for CCK-2 here.

We will have talks from:

Hope to see you there!  (Did I say free beer?)

# RDKit User Group Meeting 2016

If you’re interested in small molecules and cheminformatics, you might like to know that registration for the 2016 RDKit User Group Meeting is now open. The meeting will be held from the 26th-28th of October in Basel, Switzerland. From their announcement:

This year’s RDKit User Group Meeting will take place from 26-28 October at the Novartis Campus in Basel, Switzerland and is co-organized by people from Roche and Novartis.

Registration for the RDKit UGM is free:

https://www.eventbrite.com/e/5th-rdkit-user-group-meeting-tickets-22539677783

The previous years’ format seemed to work pretty well and the feedback was positive, so we will stick to the same format this year:

Days 1 and 2: Talks, lightning talks, roundtable(s), discussion, and talktorials. For those who haven’t attended before, talktorials are somewhere between a talk and a tutorial, they cover something interesting done with the RDKit and include the code used to do the work. During the presentation you’ll give an overview of what you did and also show the pieces of the code that are central to the work. The idea is to mix the science up with the tutorial aspects.

Day 3 will be a sprint: those who choose to stay will spend an intense day working in small groups to produce useful artifacts: new bits of code, KNIME nodes, KNIME workflows, tutorials, documentation, IPython notebooks, etc. We will once again try to structure this a bit by collecting a bunch of ideas for things to work on in advance.

There will also be, of course, social activities.

Looks like fun!

# Advances in Conformer Generation: ETKDG and ETDG

Predicting the possible shapes a small molecule can adopt is essential to understanding its chemistry and the possible biological roles of candidate drugs. Thus, conformer generation, the process of converting a topological description of a molecule into a set of 3D positions of its constituent atoms, is an essential component of computational drug discovery.

A former member of OPIG, Dr Jean-Paul Ebejer, myself, and Prof. Charlotte Deane, compared the ability of freely-available conformer generation methods:

“(i) to identify which tools most accurately reproduce experimentally determined structures;
(ii) to examine the diversity of the generated conformational set; and
(iii) to benchmark the computational time expended.”

in Ebejer et al., 2012.  JP assembled a set of 708 crystal structures of drug-like molecules from the OMEGA validation set and the Astex Diverse Set with which to test the various methods. We found that RDKit, combining its Distance Geometry (DG) algorithm with energy minimization  using the MMFF94 force field proved to be a “valid free alternative to commercial, proprietary software”, and was able to generate “a diverse and representative set of conformers which also contains a close conformer to the known structure”.

Following on from our work at InhibOx, and building on the same benchmark set JP assembled, Greg Landrum and Sereina Riniker recently described (Riniker & Landrum, 2015) two new conformer generation methods, ETDG and ETKDG, that improve upon the classical distance geometry (DG) algorithm. They do this by combining DG with knowledge of preferred torsional angles derived from experimentally determined crystal structures (ETDG), and also by further adding constraints from chemical knowledge, such as ‘aromatic rings are be flat’, or ‘bonds connected to triple bonds are colinear’ (ETKDG). They compared DG, ETDG, ETKDG, and a knowledge-based method, CONFECT, and found:

“ETKDG was found to outperform standard DG and the knowledge-based conformer generator CONFECT in reproducing crystal conformations from both small-molecule crystals (CSD data set) and protein−ligand complexes (PDB data set). With ETKDG, 84% of a set of 1290 small-molecule crystal structures from the CSD could be reproduced within an RMSD of 1.0 Å and 38% within an RMSD of 0.5 Å. The experimental torsional-angle preferences or the K terms alone each performed better than standard DG but were not sufficient to obtain the full performance of ETKDG.

Comparison of ETKDG with the DG conformers optimized using either the Universal Force Field (UFF) or the Merck Molecular Force Field (MMFF) showed different results for the two data sets. While FF-optimized DG performed better on the CSD data set, the two approaches were comparable for the PDB data set.”

They also showed (Fig. 13) that their ETKDG method was faster than DG followed by energy minimization, but not quite as accurate in reproducing the crystal structure.

ETKDG takes 3 times as long as DG. The addition of the K terms, i.e., generating ETKDG embeddings instead of ETDG embeddings, increases runtime by only 10% over ETDG (results not shown). Despite the longer runtime per conformer, ETKDG requires on average one-quarter of the number of conformers to achieve performance similar to DG (Figure S12 and Table S4 in the Supporting Information). This results in a net performance improvement, at least when it comes to reproducing crystal conformers.

As measured by performance in reproducing experimental crystal structures, ETKDG is a viable alternative to plain DG followed by a UFF-optimization, so it is of interest how their runtimes compare. Figure 13 (right) plots the runtime for ETKDG versus the runtime for DG + UFF-optimization. The median ratio of the DG + UFF optimization and ETKDG runtimes is 1.97, i.e., DG + UFF optimization takes almost twice as long as ETKDG. Thus, although ETKDG is significantly slower than DG on a per-conformer basis, when higher-quality conformations are required it can provide structures that are the equivalent of those obtained using DG + UFF-optimization in about half the time.

ETKDG looks like a great addition to the RDKit toolbox for conformer generation (and it was great to see JP thanked in the Acknowledgments!).

References

Ebejer, J. P., G. M. Morris and C. M. Deane (2012). “Freely available conformer generation methods: how good are they?” J Chem Inf Model, 52(5): 1146-1158. 10.1021/ci2004658.

Riniker, S. and G. A. Landrum (2015). “Better Informed Distance Geometry: Using What We Know To Improve Conformation Generation.” J Chem Inf Model, 55(12): 2562-2574. 10.1021/acs.jcim.5b00654.