Community structure in multilayer networks

Multilayer networks are a generalisation of network that may incorporate different types of interactions [1]. This could be different time points in temporal data, measurements in different individuals or under different experimental conditions. Currently many measures and methods from monolayer networks are extended to be applicabile to multilayer networks. Those include measures of centrality [2], or methods that enable to find mesoscale structure in networks [3,4].

Examples of such mesoscale structure detection methods are stochastic block models and community detection. Both try to find groups of nodes that behave structurally similar in a network. In its most simplistic way you might think of two groups that are densely connected internally but only sparsely between the groups. For example two classes in a high school, there are many friendships in each class but only a small number between the classes. Often we are interested in how such patterns evolve with time. Here, the usage of multilayer community detection methods is fruitful.

From [4]: Multislice community detection of U.S. Senate roll call vote similarities across time. Colors indicate assignments to nine communities of the 1884 unique senators (sorted vertically and connected across Congresses by dashed lines) in each Congress in which they appear. The dark blue and red communities correspond closely to the modern Democratic and Republican parties, respectively. Horizontal bars indicate the historical period of each community, with accompanying text enumerating nominal party affiliations of the single-slice nodes (each representing a senator in a Congress): PA, pro-administration; AA, anti-administration; F, Federalist; DR, Democratic-Republican; W, Whig; AJ, anti-Jackson; A, Adams; J, Jackson; D, Democratic; R, Republican. Vertical gray bars indicate Congresses in which three communities appeared simultaneously.

Mucha et al. analysed the voting pattern in the U.S. Senate [4]. They find that the communities are oriented as the political party organisation. However, the restructuring of the political landscape over time is observable in the multilayered community structure. For example, the 37th Congress during the beginning of the American Civil War brought a major change in the voting patterns. Modern politics is dominated by a strong partition into Democrats and Republicans with third minor group that can be identified as the ‘Southern Democrats’ that had distinguishable voting patterns during the 1960.

Such multilayer community detection methods can be insightful for networks from other disciplines. For example they have been adopted to describe the reconfiguration in the human brain during learning [5]. Hopefully they will be able to give us insight in the structure and function of protein interaction.

[1] De Domenico, Manlio; Solé-Ribalta, Albert; Cozzo, Emanuele; Kivelä, Mikko; Moreno, Yamir; Porter, Mason A.; Gómez, Sergio; and Arenas, Alex [2013]. Mathematical Formulation of Multilayer Networks, Physical Review X, Vol. 3, No. 4: 041022.

[2] Taylor, Dane; Myers, Sean A.; Clauset, Aaron; Porter, Mason A.; and Mucha, Peter J. [2016]. Eigenvector-based Centrality Measures for Temporal Networks

[3] Tiago P. Peixoto; Inferring the mesoscale structure of layered, edge-valued, and time-varying networks. Phys. Rev. E 92, 042807

[4] Mucha, Peter J.; Richardson, Thomas; Macon, Kevin; Porter, Mason A.; and Onnela, Jukka-Pekka [2010]. Community Structure in Time-Dependent, Multiscale, and Multiplex Networks, Science, Vol. 328, No. 5980: 876-878.

[5] Bassett, Danielle S.; Wymbs, Nicholas F.; Porter, Mason A.; Mucha, Peter J.; Carlson, Jean M.; and Grafton, Scott T. [2011]. Dynamic Reconfiguration of Human Brain Networks During Learning, Proceedings of the National Academy of Sciences of the United States of America, Vol. 118, No. 18: 7641-7646.

Drawing Custom Unrooted Trees from Sequence Alignments

Multiple Sequence Alignments can provide a lot of information relating to the relationships between proteins. One notable example was the map of the kinome space published in 2002 (Figure 1).

Figure 1. Kinase space as presented by Manning et al. 2002;

Such images organize our thinking about the possible space of such proteins/genes going beyond long lists of multiple sequence alignments. The image in Figure 1, got a revamp later which now is the popular ‘kinome poster’ (Figure 2).

Revamped dendrogram of the kinome fro Fig. 1. Downloaded from http://i.imgur.com/BPLUvfc.png.

Here we have created a script to produce similar dendrograms straight from the multiple sequence alignment files (although clearly not as pretty as Fig 2!). It is not difficult to find software that would produce ‘a dendrogram’ from an MSA but making it do the simple thing of annotating the nodes with colors, shapes etc. with respect to the labels of the genes/sequences is slightly more problematic. Sizes might correspond to the importance of given nodes and colors can organize by their tree branches. The script uses the Biopython module Phylo to construct a tree from an arbitrary MSA and networkx to draw it:

python Treebeard.py
import networkx, pylab
from networkx.drawing.nx_agraph import graphviz_layout
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
from Bio import AlignIO

#What color to give to the edges?
e_color = '#ccccff'
#What colors to give to the nodes with similar labels?
color_scheme = {'RSK':'#e60000','SGK':'#ffff00','PKC':'#32cd32','DMPK':'#e600e6','NDR':'#3366ff','GRK':'#8080ff','PKA':'magenta','MAST':'green','YANK':'pink'}
#What sizes to give to the nodes with similar labels?
size_scheme = {'RSK':200,'SGK':150,'PKC':350,'DMPK':400,'NDR':280,'GRK':370,'PKA':325,'MAST':40,'YANK':200}

#Edit this to produce a custom label to color mapping
def label_colors(label):
	color_to_set = 'blue'
	for label_subname in color_scheme:
		if label_subname in label:
			color_to_set = color_scheme[label_subname]
	return color_to_set

#Edit this to produce a custom label to size mapping
def label_sizes(label):
	#Default size
	size_to_set = 20
	for label_subname in size_scheme:
		if label_subname in label:
			size_to_set = size_scheme[label_subname]
	return size_to_set

#Draw a tree whose alignment is stored in msa.phy
def draw_tree():
	
	#This loads the default kinase alignment that should be in the same directory as this script
	aln = AlignIO.read('agc.aln', 'clustal')
	#This will construct the unrooted tree.
	calculator = DistanceCalculator('identity')
	dm = calculator.get_distance(aln)
	constructor = DistanceTreeConstructor()
	tree = constructor.nj(dm)
	G = Phylo.to_networkx(tree)
	node_sizes = []
	labels = {}
	node_colors = []
	for n in G:
		label = str(n)
		if 'Inner' in label:
			#These are the inner tree nodes -- leave them blank and with very small sizes.
			node_sizes.append( 1 )
			labels[n] = ''
			node_colors.append(e_color)
		else:
			#Size of the node depends on the labels!
			node_sizes.append( label_sizes(label) )
			#Set colors depending on our color scheme and label names
			node_colors.append(label_colors(label))
			#set the label that will appear in each node			
			labels[n] = label
	#Draw the tree given the info we provided!
	pos = graphviz_layout(G)
	networkx.draw(G, pos,edge_color=e_color,node_size = node_sizes, labels=labels, with_labels=True,node_color=node_colors)
	#Showing	
	pylab.show()
	#Saving the image -- uncomment
	#pylab.savefig('example.png')

if __name__ == '__main__':
	
	draw_tree()

We are going to use the kinase alignment example to demonstrate how the script can be used. The kinase alignment we use can be found here on the kinase.com website. We load the alignment and construct the unrooted tree using the Bio.Phylo module. Note that on each line of the alignment there is a name. These names are the labels that we use to define the colors and sizes of nodes. There are two dummy functions that achieve that label_nodes() and label_sizes() — if you look at them it should be clear how to define your own custom labeling.

If you download the code and the alignment and run it by:

python Treebeard.py

You should see a similar image as in Fig 3.

Fig 3. Size-color-customized unrooted tree straight from a multiple sequence alignment file of protein kinases. Constructed using the script Treebeard.py

Inserting functional proteins in an antibody

At the group meeting on the 3rd of February I presented the results of the paper “A General Method for Insertion of Functional Proteins within Proteins via Combinatorial Selection of Permissive Junctions” by Peng et. al. This is interesting to our group, and especially to me, because this is a novel way of designing an antibody, although I suspect that the scope of their research is much more general, their use of antibodies being a proof of concept.

Their premise is that the structure of a protein is essentially secondary structures and tertiary structure interconnected through junctions. As such it should be possible to interconnect regions from different proteins through junctions, and these regions should take up their native secondary and tertiary structures, thus preserving their functionality. The question is what is a suitable junction? This is important because these junctions should be flexible enough to allow the proper folding of the different regions, but also not too flexible as to have a negative impact on stability. There has been previous work done on trying to design suitable junctions, however the workflow presented in this paper is based on trying a vast number of junctions and then identifying which of them work.

As I said above their proof concept is antibodies. They used an antibody scaffold (the host), out of which they removed the H3 loop and then fused to it, using junctions, two different proteins: Leptin and FSH (the guests). To identify the correct junctions they generated a library of antibodies with random three residues sequences on either side of the inserted protein plus a generic linker (GGGGS) that can be repeated up to three times.

They say that the theoretical size of the library is 10^9 (however I would say it is 9*20^6), and the actually achieved diversity of their library was of size 2.88*10^7 for Leptin and 1.09*10^7. Next step is to identify which junctions have allowed the guest protein to fold properly. For this they devised an autocrine-based selection method using engineered cells that have beta lactamase receptors which have either Leptin or FSH as agonists. A fluoroprobe in the cell responds to the presence of beta lactamase producing a blue color, instead of green and therefore this allows the cells with the active antibody-guest designed protein (clone) to be identified using FRET-based fluorescence-activated cell sorting.

They managed to identify 6 clones that worked for Leptin and 3 that worked for FSH with the linkers being listed in the below table:

There does not seem to be a pattern emerging from those linker sequences, although one of them repeats itself. For my research it would have been interesting if a pattern did emerge, and then that could be used as a generic linker for future designers. However, this is still another prime example of how well an antibody scaffold can be used a starting point for protein engineering.

As a bonus they also tested in vivo how their designs work and they discovered that the antibody-leptin design (IgG-Leptin) has a longer lifetime. This is probably due to the fact that being a larger protein this is not filtered out by the kidneys.

Identifying basic building blocks/motifs of networks

The optimal subgraph decomposition of an electronic circuit.

There are many verbal descriptions for network motifs: characteristic connectivity patterns, over represented subgraphs, recurrent circuits, basic building-blocks of networks just to name a few. However, as with most concepts in network science network motifs are maybe best explained in terms of empirical observations. For instance the most basic example of a network motif is the motif consisting of tree mutually connected nodes that is: a triangle. Many real world networks ranging from the internet to social networks to biological networks contain many more triangles than one would expect if they were wired randomly. In certain cases there exist good explanations for the large number of triangles found in the network. For instance, the presence of many triangles in friendship networks simply tell us that we are more likely to be friends with the friends of our friends. In biological networks triangles and other motifs are believed to contribute to the overall function of the network by performing modular tasks such as information processing and therefore are believed to be favoured by natural selection.

The predominant definition of network motifs is due to Milo et al. [1] and defines network motifs on the basis of how surprising their frequency in the network is when compared to randomized version of the network. The randomized version of the network is usually taken to be the configuration model i.e. the ensemble of all networks that have the same degree distribution as the original network. Following this definition motifs are identified by comparing their counts in the original network with a large sample of this null model. The approach of Milo et al. formalizes the concept of network motifs as over represented connectivity patterns. However, the results of the method are highly dependent on the choice of null model.

In my talk I presented an alternative approach to motif analysis [2] that seeks to formalize the network motifs from the perspective of simple building blocks. The approach is based on finding an optimal decomposition of the network into subgraphs. Here, subgraph decompositions are defined as subgraph covers which are sets of subgraphs such that every edge of the network is contained in at least one of the subgraphs in the cover. It follows from this definition that a subgraph cover is a representation of the network in the sense that given a subgraph cover the network can be recovered fully by simply taking the union of the edge sets of the subgraphs in the cover. In fact many network representations including edge lists, adjacency lists, bipartite representations and power-graphs fall into the category of subgraph covers. For instance, the edge list representation is equivalent to the cover consisting of all single edge subgraphs of the network and bipartite representations are simply covers which consist of cliques of various sizes.

Given that there are many competing ways of representing a network as a subgraph cover the question of how one picks one of the covers over the other arises. In order to address this problem we consider the total information of subgraph covers as a measure of optimality. The total information is a information measure introduced by Gell-Mann and Hartle [3] which given a model for a certain entity e is defined to be sum of the entropy and effective complexity of M. While the entropy measures the information required to describe e given M and the effective complexity measures the amount of information required to specify M which is given by its algorithmic information content. The total information also provides a framework for model selection: given two or more models for the same entity one picks the one that has lowest total information and if two models have the same total information one picks the one that has lower effective complexity i.e. the simpler one. This essentially tells us how to trade off goodness of fit and model complexity.

In the context of subgraph covers the entropy of a cover corresponds to the information required to give the positions of the subgraphs in the cover given the different motifs that occur in C and their respective frequencies in C. On the other hand the effective complexity of C corresponds to the information required to describe the set of motifs occurring in the cover together with their respective frequencies. While the entropy of subgraph covers can be calculated analytically their effective complexity is not computable due to the halting problem. However, in practice one can use approximations in the form of upper bounds.

Following the total information approach we now define an optimal subgraph cover of network G to be a subgraph cover that minimizes the total information and the network motifs of G to be the motifs/connectivity patterns that occur in such an optimal cover.
The problem of finding an optimal cover turns out to be computationally rather challenging. Besides the usual difficulties associated to counting subgraphs (subgraph isomorphism problem-NP complete) and classifying subgraphs (graph isomorphism problem-complexity unknown) the problem is a non-linear set covering problem and therefore NP-hard. Consequently, we construct a greedy heuristic for the problem.

When applied to real world networks the method finds very similar motifs in networks representing similar systems. Moreover, the counts of the motifs in networks of the same type scale approximately with network size. Consequently, the method can also be used to classify networks according to their subgraph structure.

References:

[1] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, Network Motifs: Simple Building Blocks of Complex Networks, Science 298, 824 (2002)

[2] Wegner, A. E. Subgraph covers: An information-theoretic approach to motif analysis in
networks. Phys. Rev. X, 4:041026, Nov 2014

[3] M. Gell-Mann and S. Lloyd, Information Measures, Effective Complexity, and Total Information, Complexity 2, 44 (1996).

Novelty in Drug Discovery

The primary aim of drug discovery is to find novel molecules that are active against a target of therapeutic relevance and that are not covered by any existing patents (1). Due to the increasing cost of research and development in the later stages of drug discovery, and the increase in drug candidates failing at these stages, there is a desire to select the most diverse set of active molecules at the earliest stage of drug discovery, to maximise the chance of finding a molecule that can be optimised into a successful drug (2,3). Computational methods that are both accurate and efficient are one approach to this problem and can augment experiment approaches in deciding which molecules to take forward.

But what do we mean by a “novel” compound? When prioritising molecules for synthesis which characteristics do we want to be different? It was once common to select subsets of hits to maximise chemical diversity in order to cover as much chemical space as possible (4). These novel lead molecules could subsequently be optimised, the idea that maximising the coverage of chemical space would maximise the chance of finding a molecule that could be optimised successfully. More recently however, the focus has shifted to “biodiversity”: diversity in terms of how the molecule interacts with the protein (1). Activity cliffs, pairs of molecules that are structurally and chemically similar but have a large difference in potency, indicate that chemical diversity may not be the best descriptor to identify molecules that interact with the target in sufficiently diverse ways. The molecules to be taken forward should be both active against the target and diverse in terms of how they interact with the target, and the parts of the binding site the molecule interacts with.

This raises two interesting ideas. The first is prioritising molecules that form the same interactions as molecules known to bind but are chemically different: scaffold hopping (5). The second is prioritising molecules that potentially form different interactions to known binders. I hope to explore this in the coming months as part of my research.

References

(1) J. K. Medina-Franco et al., Expert Opin. Drug Discov., 2014, 9, 151-156.

(2) A. S. A. Roy, Project FDA Report, 2012, 5.

(3) J. Avorn, New England Journ. of Med., 2015, 372, 1877-1879.

(4) P. Willet, Journ. Comp. Bio., 1999, 6, 447-457.

(5) H. Zhao, Drug Discov. Today, 2007, 12, 149–155.

Designing antibodies targeting disordered epitopes

At the meeting on February 10 I covered the article by Sormanni et al. describing a methodology for computationally designing antibodies against intrinsically disordered regions of proteins.

Antibodies are proteins that are a natural part of our immune system. For over 50 years lab-made antibodies have been used in a wide variety of therapeutic and diagnostic applications. Nowadays, we can design antibodies with high specificity and affinity for almost any target. Nevertheless, engineering antibodies against intrinsically disordered proteins remains costly and unreliable. Since as many as about 33.0% of all eukaryotic proteins could be intrinsically disordered, and the disordered proteins are often implicated in various ailments and diseases such methodology could prove invaluable.

Cascade design

The initial step in the protocol involves searching the PDB for protein sequences that interact in a beta strand with segments of the target sequence. Next, such peptides are joined together using a so-called “cascade method”. The cascade method starts with the longest found peptide and grows it to the length of the target sequence by joining it with other, partially overlapping peptides coming from beta strands of the same type (parallel, antiparallel). In the cascade method, all fragments used must form the same hydrogen bond pattern. The resulting complementary peptide is expected to “freeze” part of the discorded protein by forcing it to locally form a beta sheet. After the complementary peptide is designed, it is grafted on a single-domain antibody scaffold. This decision has been made as antibodies have a longer half-life and lower immunogenicity.

To test their method the authors initially assessed the robustness of their design protocol. First, they run the cascade method on three targets – a-synuclein, Aβ42 and IAPP. They found that more than 95% of the residue position in the three proteins could be targeted by their method. In addition, the mean number of available fragments per position was 570. They also estimated their coverage on a larger scale, using 1690 disordered protein sequences obtained from DisProt database and from measured NMR chemical shifts. About 90% of residue positions from DisProt and 85% positions from the chemical shift could be covered by at least one designed peptide. The positions that were hard to target usually contained Proline, in agreement with the known result that Prolines tend to disrupt secondary structure formation.

To test the quality of their designs the authors created complementary peptides for a-synuclein, Aβ42 and IAPP and grafted them on the CDR3 region of a human single domain antibody scaffold. All designs were highly stable and bound their targets with high specificity. Following the encouraging result the authors measured the affinity of one of their designs (one of the anti-a-synuclein antibodies). The K_d was found to lie in the range 11-27 μM. Such affinity is too low for pharmaceutical purposes, but it is enough to prevent aggregation of the target protein.

As the last step in the project the authors attempted a two-peptide design, where a second peptide was grafted in the CDR2 region of the single-domain scaffold. Both peptides were designed to bind the same epitope. The two peptide design managed to reach the affinity required for pharmaceutical viability (affinity smaller than 185 nM with 95% confidence). Nevertheless, the two loop design became very unstable rendering it not viable for pharmaceutical purposes.

Overall, this study presents a very exciting step towards computationally designed antibodies targeting disordered epitopes and deepens out understanding of antibody functionality.

New toys for OPIG

OPIG recently acquired 55 additional computers all of the same make and model; they are of a decent specification (for 2015), each with quad-core i5 processor and 8GB of RAM, but what to do with them? Cluster computing time!

Along with a couple of support servers, this provides us with 228 computation cores, 440GB of RAM and >40TB of storage. Whilst this would be a tremendous specification for a single computer, parallel computing on a cluster is a significantly different beast.

This kind of architecture and parallelism really lends itself to certain classes of problems, especially those that have:

Independent data
Parameter sweeps
Multiple runs with different random seeds
Dirty great data sets
Or can be snipped up and require low inter-processor communication

With a single processor and a single core, a computer looks like this:

These days, when multiple processor cores are integrated onto a single die, the cores are normally independent but share a last-level cache and both can access the same memory. This gives a layout similar to the following:

Add more cores or more processors to a single computer and you start to tessellate the above. Each pair of cores have access to their own shared cache, they have access to their own memory and they can access the memory attached to any other cores. However, accessing memory physically attached to other cores comes at the cost of increased latency.

Cluster computing on the other hand rarely exhibits this flat memory architecture, as no node can directly another node’s memory. Instead we use a Message Passing Interface (MPI) to pass messages between nodes. Though it takes a little time to wrap your head around working this way, effectively every processor simultaneously runs the exact same piece of code, the sole difference being the “Rank” of the execution core. A simple example of MPI is getting every code to greet us with the traditional “Hello World” and tell us its rank. A single execution with mpirun simultaneously executes the code on multiple cores:

$mpirun -n 4 ./helloworld_mpi Hello, world, from processor 3 of 4 Hello, world, from processor 1 of 4 Hello, world, from processor 2 of 4 Hello, world, from processor 0 of 4

Note that the responses aren’t in order, some cores may have been busy (for example handling the operating system) so couldn’t run their code immediately. Another simple example of this would be a sort. We could for example tell every processor to take several million values, find the smallest value and pass a message to whichever core has “Rank 0” that number. The core at Rank 0 will then sort that much smaller number set of values. Below is the kind of speedup which was achieved by simply splitting the same problem over 4 physically independent computers of the cluster.

As not everyone in the group will have the time or inclination to MPI-ify their code, there is also HTCondor. HTCondor, is a workload management system for compute intensive jobs which allows jobs to be queued, scheduled, assigned priorities and distributed from a single head node to processing nodes, with the results copied back on demand. The server OPIG provides the job distribution system, whilst SkyOctopus provides shared storage on every computation node. Should the required package currently not be available on all of the computation nodes, SkyOctopus can reach down and remotely modify the software installations on all of the lesser computation nodes.

Loop Model Selection

As I have talked about in previous blog posts (here and here, if interested!), the majority of my research so far has focussed on improving our ability to generate loop decoys, with a particular focus on the H3 loop of antibodies. The loop modelling software that I have been developing, Sphinx, is a hybrid of two other methods – FREAD, a knowledge-based method, and our own ab initio method. By using this hybrid approach we are able to produce a decoy set that is enriched with near-native structures. However, while the ability to produce accurate loop conformations is a major advantage, it is by no means the full story – how do we know which of our candidate loop models to choose?

In order to choose which model is the best, a method is required that scores each decoy, thereby producing a ranked list with the conformation predicted to be best at the top. There are two main approaches to this problem – physics-based force fields and statistical potentials.

Force fields are functions used to calculate the potential energy of a structure. They normally include terms for bonded interactions, such as bond lengths, bond angles and dihedral angles; and non-bonded interactions, such as electrostatics and van der Waal’s forces. In principle, they can be very accurate, however they have certain drawbacks. Since some terms have a very steep dependency on interatomic distance (in particular the non-bonded terms), very slight conformational differences can have a huge effect on the score. A loop conformation that is very close to the native could therefore be ranked poorly. In addition, solvation terms have to be used – this is especially important in loop modelling applications since loop regions are generally found on the surface of proteins, where they are exposed to solvent molecules.

The alternatives to physics-based force fields are statistical potentials. In this case, a score is achieved by comparing the model structure (i.e. its interatomic distances and contacts) to experimentally-derived structures. As a very simple example, if the distance between the backbone N and Cα of a residue in a loop model is 2Å, but this distance has not been observed in known structures, we can assume that a distance of 2Å is energetically unfavourable, and so we can tell that this model is unlikely to be close to the native structure. Advantages of statistical potentials over force fields are their relative ‘smoothness’ (i.e. small variations in structure do not affect the score as much), and the fact that all interactions do not need to be completely understood – if examples of these interactions have been observed before, they will automatically be taken into account.

I have tested several statistical potentials (including calRW, DFIRE, DOPE and SoapLoop) by using them to rank the loop decoys generated by our hybrid method, Sphinx. Unfortunately, none of them were consistently able to choose the best decoy out of the set. The average RMSD (across 70 general loop targets) of the top-ranked decoy ranged between 2.7Å and 4.74Å for the different methods – the average RMSD of the actual best decoy was much lower at 1.32Å. Other researchers have also found loop ranking challenging – for example, in the latest Antibody Modelling Assessment (AMA-II), ranking was seen as an area for significant improvement. In fact, model selection is seen as such an issue that protein structure prediction competitions like AMA-II and CASP allow the participants to submit more than one model. Loop model selection is therefore an unsolved problem, which must be investigated further to enable reliable predictions to be made.

Network Pharmacology

The dominant paradigm in drug discovery has been one of finding small molecules (or more recently, biologics) that bind selectively to one target of therapeutic interest. This reductionist approach conveniently ignores the fact that many drugs do, in fact, bind to multiple targets. Indeed, systems biology is uncovering an unsettling picture for comfortable reductionists: the so-called ‘magic bullet’ of Paul Ehrlich, a single compound that binds to a single target, may be less effective than a compound with multiple targets. This new approach—network pharmacology—offers new ways to improve drug efficacy, to rescue orphan drugs, re-purpose existing drugs, predict targets, and predict side-effects.

Building on work Stuart Armstrong and I did at InhibOx, a spinout from the University of Oxford’s Chemistry Department, and inspired by the work of Shoichet et al. (2007), Álvaro Cortes-Cabrera and I took our ElectroShape method, designed for ultra-fast ligand-based virtual screening (Armstrong et al., 2010 & 2011), and built a new way of exploring the relationships between drug targets (Cortes-Cabrera et al., 2013). Ligand-based virtual screening is predicated on the molecular similarity principle: similar chemical compounds have similar properties (see, e.g., Johnson & Maggiora, 1990). ElectroShape built on the earlier pioneering USR (Ultra-fast Shape Recognition) work of Pedro Ballester and Prof. W. Graham Richards at Oxford (Ballester & Richards, 2007).

Our new approach addressed two Inherent limitations of the network pharmacology approaches available at the time:

Chemical similarity is calculated on the basis of the chemical topology of the small molecule; and
Structural information about the macromolecular target is neglected.

Our method addressed these issues by taking into account 3D information from both the ligand and the target.

The approach involved comparing the similarity of each set ligands known to bind to a protein, to the equivalent sets of ligands of all other known drug targets in DrugBank, DrugBank is a tremendous “bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information.” This analysis generated a network of related proteins, connected by the similarity of the sets of ligands known to bind to them.

We looked at two different kinds of ligand similarity metrics, the inverse Manhattan distance of our ElectroShape descriptor, and compared them to 2D Morgan fingerprints, calculated using the wonderful open source cheminformatics toolkit, RDKit from Greg Landrum. Morgan fingerprints use connectivity information similar to that used for the well known ECFP family of fingerprints, which had been used in the SEA method of Keiser et al. We also looked at the problem from the receptor side, comparing the active sites of the proteins. These complementary approaches produced networks that shared a minimal fraction (0.36% to 6.80%) of nodes: while the direct comparison of target ligand-binding sites could give valuable information in order to achieve some kind of target specificity, ligand-based networks may contribute information about unexpected interactions for side-effect prediction and polypharmacological profile optimization.

Our new target-fishing approach was able to predict drug adverse effects, build polypharmacology profiles, and relate targets from two complementary viewpoints:
ligand-based, and target-based networks. We used the DUD and WOMBAT benchmark sets for on-target validation, and the results were directly comparable to those obtained using other state-of-the-art target-fishing approaches. Off-target validation was performed using a limited set of non-annotated secondary targets for already known drugs. Comparison of the predicted adverse effects with data contained in the SIDER 2 database showed good specificity and reasonable selectivity. All of these features were implemented in a user-friendly web interface that: (i) can be queried for both polypharmacology profiles and adverse effects, (ii) links to related targets in ChEMBLdb in the three networks (2D, 4D ligand and 3D receptor), and (iii) displays the 2D structure of already annotated drugs.

References

Armstrong, M. S., G. M. Morris, P. W. Finn, R. Sharma, L. Moretti, R. I. Cooper and W. G. Richards (2010). “ElectroShape: fast molecular similarity calculations incorporating shape, chirality and electrostatics.” J Comput Aided Mol Des, 24(9): 789-801. 10.1007/s10822-010-9374-0.

Armstrong, M. S., P. W. Finn, G. M. Morris and W. G. Richards (2011). “Improving the accuracy of ultrafast ligand-based screening: incorporating lipophilicity into ElectroShape as an extra dimension.” J Comput Aided Mol Des, 25(8): 785-790. 10.1007/s10822-011-9463-8.

Ballester, P. J. and W. G. Richards (2007). “Ultrafast shape recognition to search compound databases for similar molecular shapes.” J Comput Chem, 28(10): 1711-1723. 10.1002/jcc.20681.

Cortes-Cabrera, A., G. M. Morris, P. W. Finn, A. Morreale and F. Gago (2013). “Comparison of ultra-fast 2D and 3D ligand and target descriptors for side effect prediction and network analysis in polypharmacology.” Br J Pharmacol, 170(3): 557-567. 10.1111/bph.12294.

Johnson, A. M., & G. M. Maggiora (1990). “Concepts and Applications of Molecular Similarity.” New York: John Willey & Sons.

Landrum, G. (2011). “RDKit: Open-source cheminformatics.” from http://www.rdkit.org.

Keiser, M. J., B. L. Roth, B. N. Armbruster, P. Ernsberger, J. J. Irwin and B. K. Shoichet (2007). “Relating protein pharmacology by ligand chemistry.” Nat Biotechnol, 25(2): 197-206. 10.1038/nbt1284.

Wishart, D. S., C. Knox, A. C. Guo, S. Shrivastava, M. Hassanali, P. Stothard, Z. Chang and J. Woolsey (2006). “DrugBank: a comprehensive resource for in silico drug discovery and exploration.” Nucleic Acids Res, 34(Database issue): D668-672. 10.1093/nar/gkj067.

Co-translational insertion and folding of membrane proteins

The alpha-helical bundle is the most common type of fold for membrane proteins. Their diverse functions include transport, signalling, and catalysis. While structure determination is much more difficult for membrane proteins than it is for soluble proteins, it is accelerating and there are now 586 unique proteins in the database of Membrane Proteins of Known 3D Structure. However, we still have quite a poor understanding of how membrane proteins fold. There is increasing evidence that it is more complicated than the two-stage model proposed in 1990 by Popot and Engelman.

The machinery that inserts most alpha-helical membrane proteins is the Sec apparatus. In prokaryotes, it is located in the plasma membrane, while eukaryotic Sec is found in the ER. Sec itself is an alpha-helical bundle in the shape of a pore, and its structure is able both to allow peptides to pass fully across the membrane, and also to open laterally to insert transmembrane helices into the membrane. In both cases, this occurs co-translationally, with translation halted by the signal recognition particle until the ribosome is associated with the Sec complex.

Voorhees, R. M. et al. (2014) Cell, 157(7), 1632–43

If helices are inserted during the process of translation, does folding only begin after translation is finished? On what timescale are these folding processes occuring? There is evidence that a hairpin of two transmembrane helices forms on a timescale of miliseconds in vitro. Are helices already interacting during translation to form components of the native structure? It has also been suggested that helices may insert into the membrane in pairs, via the Sec apparatus.

There are still many aspects of the insertion process which are not fully understood, and even the topology of an alpha-helical membrane protein can be affected by the last part of the protein to be translated. I am starting to investigate some of these questions by using computational tools to learn more about the membrane proteins whose structures have already been solved.

Oxford Protein Informatics Group

or "OPIG" to friends