Do we need the constant regions of Antibodies and T-cell receptors in molecular simulations?

At this week’s journal club I presented my latest results on the effect of the constant regions of antibodies (ABs) and T-cell receptors (TCRs) on the dynamics of the overall system. Not including constant regions in such simulations is a commonly used simplification that is found throughout the literature. This is mainly due a massive saving in computational runtime as illustrated below: cutConstRegions

The constant regions contain about 210 residues but an additional speed up comes from the much smaller solvation box. If a cubic solvation box is used then the effect is even more severe:

waterbathBut the question is: “Is is OK to remove the constant regions of an AB or TCR and simulate without them?”.

Using replica simulations we found that simulations with and without constant regions lead to (on average) significantly different results. The detail of our analysis will soon be submitted to a scientific journal. The current working title is “Why constant regions are essential in antibody and T-cell receptor Molecular Dynamics simulations”.

Visualising Biological Data, Pt. 1

Hey Blopig Readers,

I had the privilege to go down to Heidelberg last week to go and see some stunning posters and artwork. I really recommend that you check some of the posters out. In particular, the “Green Fluorescent Protein” poster stuck out as my favourite. Also, if you’re a real Twitter geek, check out #Vizbi for some more tweets throughout the week.

So what did the conference entail? As a very blunt summary, it was really an eclectic collection of researchers around the globe who showcased their research with very neat visual media. While I was hoping for a conference that gave an overview of some of the principles that dictate how to visualise proteins, genes, etc., it wasn’t like that at all! Although I was initially a bit disappointed, it turned out to be better – one of the key themes that were re-iterated throughout the conference is that visualisations are dependent on the application!

From the week, these are the top 5 lessons I walked away with, and I hope you can integrate this into your own visualisation:

  1. There is no pre-defined, accepted way of visualising data. Basically, every visualisation is tailored, has a specific purpose, so don’t try to fit your graph into something pretty that you’ve seen in another paper. We’re encouraged to get insight from others, but not necessarily replicate a graph.
  2. KISS (Keep it simple, stupid!) Occam’s razor, KISS, whatever you want to call it – keep things simple. Making an overly complicated visualisation may backfire.
  3. Remember your colours. Colour is probably one of the most powerful tools in our arsenal for making the most of a visualisation. Don’t ignore them, and make sure that they’re clean, separate, and interpretable — even to those who are colour-blind!
  4. Visualisation is a means of exploration and explanation. Make lots, and lots of prototypes of data visuals. It will not only help you explore the underlying patterns in your data, but help you to develop the skills in explaining your data.
  5. Don’t forget the people. Basically, a visualisation is really for a specific target audience, not for a machine. What you’re doing is to encourage connections, share knowledge, and create an experience so that people can learn your data.

I’ll come back in a few weeks’ time after reviewing some tools, stay tuned!

Journal Club: Comments on Three X-ray Crystal Structure Papers

One of the fundamental weaknesses of X-ray crystallography, when used for the solution of macromolecular structures, is that the constructed models are based on the subjective interpretation of the electron density by the crystallographer.

This can lead to poor or simply incorrect models, as discussed by Stanfield et al. in their recent paper “Comment on Three X-ray Crystal Structure Papers” (link below). Here, they assert that the basis of several papers by Dr. Salunke and his coworkers, a series of antibody-peptide complexes, are fundamentally flawed. It is argued that the experimental electron density does not support the presence of the peptide models: there is no significant positive OMIT density for the peptides when they are removed from the model, and the quality of the constructed models is poor, with unreasonably large B-factors.

Link to paper: http://www.jimmunol.org/content/196/2/521.1.


Firstly, a quick recap on crystallographic maps and how they are used. Two map types are principally used in macromolecular crystallography: composite maps and difference maps.

The composite map is used to approximate the electron density of the crystal. It consists of the modelled density subtracted from the observed electron density (multiplied by 2) and contains correction factors to minimise phase bias (m, D). The weighting of 2 of the observed map is used to compensate for the poor phases, which cause un-modelled features to appear only weakly in the density. It is universally represented as a blue mesh:

composite

The difference map is the modelled density subtracted from the observed density, and is used to identify un-modelled areas of the electron density. It contains the same correction factors to compensate for phase bias. It is universally represented as a green mesh for positive values, and a red mesh for negative values. The green and red meshes are always contoured at the same absolute values, e.g. ±1 or ±1.4.

difference

The problem of identifying features to model in the electron density is the point where the subjectivity of the crystallographer is most influential. For ligands, this means identifying blobs that are “significant”, and that match the shape of the molecule to be modelled.

When a crystallographer is actively searching for the presence of a binding molecule, in this case a peptide, it is easy to misinterpret density as the molecule you are searching for. You have to be disciplined and highly critical before accidentally contouring to levels that are too low, and modelling into density that does not really match the model. This is the case in the series of structures that are criticised by Stanfield et al.


Specific concerns with the structures of Dr Salenke et al.

1: Contouring difference maps at only positive values (and colouring them blue?)

The first questionable thing that Dr Salunke et al do is to present a difference map contoured at only positive values as evidence for the bound peptide. This oddity is compounded by colouring the resulting map as blue, which is unusual for a difference map.

bad1

2: Contouring difference maps to low values

Salunke et al claim that the image above shows adequate evidence for the binding of the peptide in a difference map contoured at 1.7𝛔.

When contouring difference maps to such low levels, weak features will indeed be detectable, if they are present, but in solvent channels, where the crystal is an ensemble of disordered states, there is no way to interpret the density as an atomic model. Hence, a difference map at 1.7𝛔 will show blobs across all of the solvent channels in the crystal.

This fact, in itself, does not prove that the model is wrong, but makes it highly likely that the model is a result of observation bias. This observation bias occurs because the authors were looking for evidence of the binding peptide, and so inspected the density at the binding site. This has lead to the over-interpretation of noisy and meaningless density as the peptide.

The reason that the 3𝛔 limit is used to identify crystallographic features in difference maps is that this identifies only strong un-modelled features that are unlikely to be noise, or a disordered feature.

More worryingly, the model does not actually fit the density very well.

3: Poor Model Quality

Lastly, the quality of the modelled peptides is very poor. The B-factors of the ligands are much higher than the B-factors of the surrounding protein side-chains. This is symptomatic of the modelled feature not being present in the data, and the refinement program tries to “erase” the presence of the model by inflating the B-factors. This, again, does not prove that the model is wrong, but highlights the poor quality of the model.

Lastly, the ramachandran outliers in the peptides are extreme, with values in the 0th percentile of empirical values. This means that the conformer of the peptide is highly strained, and therefore highly unlikely.


Combining all of the evidence above, as presented in the article written by Stanfield et al, there is little doubt that the models presented by Salunke et al are incorrect. Individual failings in the models in one area could be explained, but such a range of errors across such a range of quality metrics cannot.

Network Comparison

Why network comparison?

Many complex systems can be represented as networks, including friendships (e.g. Facebook), the World Wide Web trade relations and biological interactions. For a friendship network, for example, individuals are represented as nodes and an edge between two nodes represents a friendship. The study of networks has thus been a very active area of research in recent years, and, in particular, network comparison has become increasingly relevant. Network comparison, itself, has many wide-ranging applications, for example, comparing protein-protein interaction networks could lead to increased understanding of underlying biological processes. Network comparison can also be used to study the evolution of networks over time and for identifying sudden changes and shocks.

net

An example of a network.

How do we compare networks?

There are numerous methods that can be used to compare networks, including alignment methods, fitting existing models,
global properties such as density of the network, and comparisons based on local structure. As a very simple example, one could base comparisons on a single summary statistic such as the number of triangles in each network. If there was a significant difference between these counts (relative to the number of nodes in each network) then we would conclude that the networks are different; for example, one may be a social network in which triangles are common – “friends of friends are friends”. However, this a very crude approach and is often not helpful to the problem of determining whether the two networks are similar. Real-world networks can be very large, are often deeply inhomogeneous and have multitude of properties, which makes the problem of network comparison very challenging.

A network comparison methodology: Netdis

Here, we describe a recently introduced network comparison methodology. At the heart of this methodology is a topology-based similarity measure between networks, Netdis [1]. The Netdis statistic assigns a value between 0 and 1 (close to 1 for very good matches between networks and close to 0 for similar networks) and, consequently, allows many networks to be compared simultaneously via their Netdis values.

The method

Let us now describe how the Netdis statistic is obtained and used for comparison of the networks G and H with n and m nodes respectively.

For a network G, pick a node i and obtain its two-step ego-network. That is, the network induced by collection of all nodes in G that are connected to i via a path containing at most two edges. By induced we mean that a edge is present in the two-step ego-network of i if and only if it is also present in the original network G. We then count the number of times that various subgraphs occur in the ego-network, which we denote by N_{w,i}(G) for subgraph w. For computational reasons, this is typically restricted to subgraphs on 5 or fewer nodes. This processes is repeated for all nodes in G, for fixed k=3,4,5.

  1. Under an appropriately chosen null model, an expected value for the quantities N_{w,i}(G) is given, denoted by E_w^i(G). We omit some of these details here, but the idea is to centre the quantities N_{w,i}(G) to remove background noise from an individual networks.
  2. Under an appropriately chosen null model, an expected value for the quantities N_{w,i}(G) is given, denoted by E_w^i(G).  We omit some of these details here, but the idea is to centre the quantities N_{w,i}(G) to remove background noise from an individual networks.
  3. Calculate:  eq1
  4. To compare networks G and H, define:eq2where A(k) is the set of all subgraphs on k nodes andeq3is a normalising constant that ensures that the statistic netD_2^S(k)  takes values between -1 and 1.  The corresponding Netdis statistic is: eq4 which now takes values in the interval between 0 and 1.
  5. The pairwise Netdis values from the equation above are then used to build a similarity matrix for all query networks. This can be done for any k \geq3, but for computational reasons, this typically needs to be limited to k\leq5. Note that for     k=3,4,5 we obtain three different distance matrices.
  6. The performance of Netdis can be assessed by comparing the nearest neighbour    assignments of networks according to Netdis with a ‘ground truth’ or ‘reference’ clustering. A  network is said to have a correct nearest neighbour whenever its nearest neighbour according to Netdis is in the same cluster as the network itself.  The overall performance of   Netdis on a given data set can then be quantified using the nearest neighbour score (NN),   which for a given set of networks is defined to be the fraction of networks that are assigned correct nearest neighbours by Netdis.
tree

The phylogenetic tree obtained by Netdis for protein interaction networks. The tree agrees with the currently accepted phylogeny between these species.

Why Netdis?

The Netdis methodology has been shown to be effective at correctly clustering networks from a variety of data sets, including both model networks and real world networks, such Facebook. In particular, the methodology allowed for the correct phylogenetic tree for five species (human, yeast, fly, hpylori and ecoli) to be obtained from a Netdis comparison of their protein-protein interaction networks. Desirable properties of the Netdis methodology are the following:

\item The statistic is based on counts of small subgraphs (for example triangles) in local neighbourhoods of nodes. By taking into account a variety of subgraphs, we capture the topology more effectively than by just considering a single summary statistic (such as number of triangles). Also, by considering local neighbourhoods, rather than global summaries, we can often deal more effectively with inhomogeneous graphs.

  • The Netdis statistic contains a centring by subtracting background expectations from a null model. This ensures that the statistic is not dominating by noise from individual networks.
  • The statistic also contains a rescaling to ensure that counts of certain commonly represented subgraphs do not dominate the statistic. This also allows for effective comparison even when the networks we are comparing have a different number of nodes.
  • The statistic is normalised to take values between 0 and 1 (close to 1 for very good matches between networks and close to 0 for similar networks). The statistic gives values between 0 and 1 and based on this number, we can simultaneously compare many networks; networks with Netdis value close to one can be clustered together. This offers the possibility of network phylogeny reconstruction.
A new variant of Netdis: subsampling
sampling

The performance of Netdis under subsampling for a data set consisting of protein interaction networks. The performance of Netdis starts to deteriorate significantly only after less than 10% of ego networks are sampled.

Despite the power of Netdis as an effective network comparison method, like many other network comparison methods, it can become computationally expensive for large networks. In such situations the following variant of Netdis may be preferable (see [2]). This variant works by only querying a small subsample of the nodes in each network. An analogous Netdis statistic is then computed based on subgraph counts in the two-step ego networks of the sampled nodes. From numerous simulation studies and experimentations, it has been shown that this statistic based on subsampling is almost as effective as Netdis provided that at least 5 percent of the nodes in each network are sampled, with the new statistic only really dropping off significantly when fewer than 1 percent of nodes are sampled. Remarkably, this procedure works well for inhomogeneous real-world networks, and not just for networks realised from classical homogeneous random graphs, in which case one would not be surprised that the procedure works.

Other network comparison methods

Finally, we note that Netdis is one of many network comparison methodologies present in the literature Other popular network comparison methodologies include GCD [3], GDDA [4], GHOST [5], MI-Graal [6] and NETAL [7].

[1]  Ali W., Rito, T., Reinert, G., Sun, F. and Deane, C. M. Alignment-free protein
interaction network comparison. Bioinformatics 30 (2014), pp. i430–i437.

[2] Ali, W., Wegner, A. E., Gaunt, R. E., Deane, C. M. and Reinert, G. Comparison of
large networks with sub-sampling strategies. Submitted, 2015.

[3] Yaveroglu, O. N., Malod-Dognin, N., Davis, D., Levnajic, Z., Janjic, V., Karapandza,
R., Stojmirovic, A. and Prˇzulj, N. Revealing the hidden language of complex networks. Scientific Reports 4 Article number: 4547, (2014)

[4] Przulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 23 (2007), pp. e177–e183.

[5] Patro, R. and Kingsford, C. Global network alignment using multiscale spectral
signatures. Bioinformatics 28 (2012), pp. 3105–3114.

[6] Kuchaiev, O. and Przulj, N. Integrative network alignment reveals large regions of
global network similarity in yeast and human. Bioinformatics 27 (2011), pp. 1390–
1396.

[7] Neyshabur, B., Khadem, A., Hashemifar, S. and Arab, S. S. NETAL: a new graph-
based method for global alignment of protein–protein interaction networks. Bioinformatics 27 (2013), pp. 1654–1662.

Strachey Lecture – “Artificial Intelligence and the Future” by Dr. Demis Hassabis

For this week’s group meeting, some of us had the pleasure of attending a very interesting lecture by Dr. Demis Hassabis, founder of Deep Mind. Personally, I found the lecture quite thought-evoking and left the venue with a plethora of ideas sizzling in my brain. Since one of the best ways to end mental sizzlingness is by writing things down, I volunteered to write this week’s blog post in order to say my peace about yesterday’s Strachey Lecture.

Dr. Hassabis began by listing some very audacious goals: “To solve intelligence” and “To use it to make a better world”. At the end of his talk, someone in the audience asked him if he thought it was possible to achieve these goals (“to fully replicate the brain”), to which he responded with a simple there is nothing that tells us that we can’t.

After his bold introductory statement, Dr. Hassabis pressed on. For the first part of his lecture, he engaged the audience with videos and concepts of a reinforcement learning agent trained to learn and play several ATARI games. I was particularly impressed with the notion that the same agent could be used to achieve a professional level of gaming for 49 different games. Some of the videos are quite impressive and can be seen here or here. Suffice to say that their algorithm was much better at playing ATARi than I’ll ever be. It was also rather impressive to know that all the algorithm received as input was the game’s score and the pixels on the screen.

Dr. Hassabis mentioned in his lecture that games provide the ideal training ground for any form of AI. He presented several reasons for this, but the one that stuck with me was the notion that games quite often present a very simplistic and clear score. Your goal in a game is usually very well defined. You help the frog cross the road or you defeat some aliens for points. However, what I perceive to be the greatest challenge for AI is the fact that real world problems do not come with such a clear-cut, incremental score.

For instance, let us relate back to my particular scientific question: protein structure prediction. It has been suggested that much simpler algorithms such as Simulated Annealing are able to model protein structures as long as we have a perfect scoring system [Yang and Zhou, 2015]. The issue is, currently, the only way we have to define a perfect score is to use the very structure we are trying to predict (which kinda takes the whole prediction part out of the story).

Real world problems are hard. I am sure this is no news to anyone, including the scientists at Deep Mind.

During the second part of his talk, Dr. Hassabis focused on AlphaGo. AlphaGo is Deep Mind’s effort at mastering the ancient game of Go. What appealed to me in this part of the talk is the fact that Go has such a large number of possible configurations that devising an incremental score is no simple task (sounds familiar?). Yet, somehow, Deep Mind scientists were able to train their algorithm to a point where it defeated a professional Go player.

Their next challenge? In two weeks, AlphaGo will face the professional Go player with the highest number of titles in the last decade (the best player in the world?). This makes me reminiscent of when Garry Kasparov faced Deep Blue. After the talk, my fellow OPIG colleagues also seemed to be pretty excited about the outcome of the match (man vs. food computer).

Dr. Hassabis finished by saying that his career goal would be to develop AI that is capable of helping scientists tackle the big problems. From what I gather (and from my extremely biased point of view; protein structure prediction mindset), AI will only be able to achieve this goal once it is capable of coming up with its own scores for the games we present it to play with (hence developing some form of impetus). Regardless of how far we are from achieving this, at least we have a reason to cheer for AlphaGo in a couple of weeks (because hey, if you are trying to make our lives easier with clever AI, I am all up for it).

Community structure in multilayer networks

 

Multilayer networks are a generalisation of network that may incorporate different types of interactions [1]. This could be different time points in temporal data, measurements in different individuals or under different experimental conditions. Currently many measures and methods from monolayer networks are extended to be applicabile to multilayer networks. Those include measures of centrality [2], or methods that enable to find mesoscale structure in networks [3,4].

Examples of such mesoscale structure detection methods are stochastic block models and community detection. Both try to find groups of nodes that behave structurally similar in a network. In its most simplistic way you might think of two groups that are densely connected internally but only sparsely between the groups. For example two classes in a high school, there are many friendships in each class but only a small number between the classes. Often we are interested in how such patterns evolve with time. Here, the usage of multilayer community detection methods is fruitful.

mucha_senate_pic

From [4]: Multislice community detection of U.S. Senate roll call vote similarities across time. Colors indicate assignments to nine communities of the 1884 unique senators (sorted vertically and connected across Congresses by dashed lines) in each Congress in which they appear. The dark blue and red communities correspond closely to the modern Democratic and Republican parties, respectively. Horizontal bars indicate the historical period of each community, with accompanying text enumerating nominal party affiliations of the single-slice nodes (each representing a senator in a Congress): PA, pro-administration; AA, anti-administration; F, Federalist; DR, Democratic-Republican; W, Whig; AJ, anti-Jackson; A, Adams; J, Jackson; D, Democratic; R, Republican. Vertical gray bars indicate Congresses in which three communities appeared simultaneously.

Mucha et al. analysed the voting pattern in the U.S. Senate [4]. They find that the communities are oriented as the political party organisation. However, the restructuring of the political landscape over time is observable in the multilayered community structure. For example, the 37th Congress during the beginning of the American Civil War brought a major change in the voting patterns. Modern politics is dominated by a strong partition into Democrats and Republicans with third minor group that can be identified as the ‘Southern Democrats’ that had distinguishable voting patterns during the 1960.

Such multilayer community detection methods can be insightful for networks from other disciplines. For example they have been adopted to describe the reconfiguration in the human brain during learning [5]. Hopefully they will be able to give us insight in the structure and function of protein interaction.

[1] De Domenico, Manlio; Solé-Ribalta, Albert; Cozzo, Emanuele; Kivelä, Mikko; Moreno, Yamir; Porter, Mason A.; Gómez, Sergio; and Arenas, Alex [2013]. Mathematical Formulation of Multilayer NetworksPhysical Review X, Vol. 3, No. 4: 041022.

[2] Taylor, Dane; Myers, Sean A.; Clauset, Aaron; Porter, Mason A.; and Mucha, Peter J. [2016]. Eigenvector-based Centrality Measures for Temporal Networks

[3]  Tiago P. Peixoto; Inferring the mesoscale structure of layered, edge-valued, and time-varying networks. Phys. Rev. E 92, 042807

[4] Mucha, Peter J.; Richardson, Thomas; Macon, Kevin; Porter, Mason A.; and Onnela, Jukka-Pekka [2010]. Community Structure in Time-Dependent, Multiscale, and Multiplex NetworksScience, Vol. 328, No. 5980: 876-878.

[5] Bassett, Danielle S.; Wymbs, Nicholas F.; Porter, Mason A.; Mucha, Peter J.; Carlson, Jean M.; and Grafton, Scott T. [2011]. Dynamic Reconfiguration of Human Brain Networks During LearningProceedings of the National Academy of Sciences of the United States of America, Vol. 118, No. 18: 7641-7646.

 

Drawing Custom Unrooted Trees from Sequence Alignments

Multiple Sequence Alignments can provide a lot of information relating to the relationships between proteins. One notable example was the map of the kinome space published in 2002 (Figure 1).

 

Figure 1. Kinase space as presented by Manning et al. 2002;

Such images organize our thinking about the possible space of such proteins/genes going beyond long lists of multiple sequence alignments. The image in Figure 1, got a revamp later which now is the popular ‘kinome poster’ (Figure 2).

Revamped dendrogram of the kinome fro Fig. 1. Downloaded from http://i.imgur.com/BPLUvfc.png.

Here we have created a script to produce similar dendrograms straight from the multiple sequence alignment files (although clearly not as pretty as Fig 2!). It is not difficult to find software that would produce ‘a dendrogram’ from an MSA but making it do the simple thing of annotating the nodes with colors, shapes etc. with respect to the labels of the genes/sequences is slightly more problematic. Sizes might correspond to the importance of given nodes and colors can organize by their tree branches. The script uses the Biopython module Phylo to construct a tree from an arbitrary MSA and networkx to draw it:

python Treebeard.py
import networkx, pylab
from networkx.drawing.nx_agraph import graphviz_layout
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor
from Bio import AlignIO

#What color to give to the edges?
e_color = '#ccccff'
#What colors to give to the nodes with similar labels?
color_scheme = {'RSK':'#e60000','SGK':'#ffff00','PKC':'#32cd32','DMPK':'#e600e6','NDR':'#3366ff','GRK':'#8080ff','PKA':'magenta','MAST':'green','YANK':'pink'}
#What sizes to give to the nodes with similar labels?
size_scheme = {'RSK':200,'SGK':150,'PKC':350,'DMPK':400,'NDR':280,'GRK':370,'PKA':325,'MAST':40,'YANK':200}

#Edit this to produce a custom label to color mapping
def label_colors(label):
	color_to_set = 'blue'
	for label_subname in color_scheme:
		if label_subname in label:
			color_to_set = color_scheme[label_subname]
	return color_to_set

#Edit this to produce a custom label to size mapping
def label_sizes(label):
	#Default size
	size_to_set = 20
	for label_subname in size_scheme:
		if label_subname in label:
			size_to_set = size_scheme[label_subname]
	return size_to_set

#Draw a tree whose alignment is stored in msa.phy
def draw_tree():
	
	#This loads the default kinase alignment that should be in the same directory as this script
	aln = AlignIO.read('agc.aln', 'clustal')
	#This will construct the unrooted tree.
	calculator = DistanceCalculator('identity')
	dm = calculator.get_distance(aln)
	constructor = DistanceTreeConstructor()
	tree = constructor.nj(dm)
	G = Phylo.to_networkx(tree)
	node_sizes = []
	labels = {}
	node_colors = []
	for n in G:
		label = str(n)
		if 'Inner' in label:
			#These are the inner tree nodes -- leave them blank and with very small sizes.
			node_sizes.append( 1 )
			labels[n] = ''
			node_colors.append(e_color)
		else:
			#Size of the node depends on the labels!
			node_sizes.append( label_sizes(label) )
			#Set colors depending on our color scheme and label names
			node_colors.append(label_colors(label))
			#set the label that will appear in each node			
			labels[n] = label
	#Draw the tree given the info we provided!
	pos = graphviz_layout(G)
	networkx.draw(G, pos,edge_color=e_color,node_size = node_sizes, labels=labels, with_labels=True,node_color=node_colors)
	#Showing	
	pylab.show()
	#Saving the image -- uncomment
	#pylab.savefig('example.png')

if __name__ == '__main__':
	
	draw_tree()

We are going to use the kinase alignment example to demonstrate how the script can be used. The kinase alignment we use can be found here on the kinase.com website. We load the alignment and construct the unrooted tree using the Bio.Phylo module. Note that on each line of the alignment there is a name. These names are the labels that we use to define the colors and sizes of nodes. There are two dummy functions that achieve that label_nodes() and label_sizes() — if you look at them it should be clear how to define your own custom labeling.

If you download the code and the alignment and run it by:

python Treebeard.py

You should see a similar image as in Fig 3.

Fig 3. Size-color-customized unrooted tree straight from a multiple sequence alignment file of protein kinases. Constructed using the script Treebeard.py

 

 

 

 

 

 

 

Inserting functional proteins in an antibody

At the group meeting on the 3rd of February I presented the results of the paper “A General Method for Insertion of Functional Proteins within Proteins via Combinatorial Selection of Permissive Junctions” by Peng et. al. This is interesting to our group, and especially to me, because this is a novel way of designing an antibody, although I suspect that the scope of their research is much more general, their use of antibodies being a proof of concept.

Their premise is that the structure of a protein is essentially secondary structures and tertiary structure interconnected through junctions. As such it should be possible to interconnect regions from different proteins through junctions, and these regions should take up their native secondary and tertiary structures, thus preserving their functionality. The question is what is a suitable junction? ThisScreen Shot 2016-02-03 at 14.37.34 is important because these junctions should be flexible enough to allow the proper folding of the different regions, but also not too flexible as to have a negative impact on stability. There has been previous work done on trying to design suitable junctions, however the workflow presented in this paper is based on trying a vast number of junctions and then identifying which of them work.

As I said above their proof concept is antibodies. They used an antibody scaffold (the host), out of which they removed the H3 loop and then fused to it, using  junctions, two different proteins: Leptin and FSH (the guests). To identify the correct junctions they generated a library of antibodies with random three residues sequences on either side of the inserted protein plus a generic linker (GGGGS) that can be repeated up to three times.Screen Shot 2016-02-03 at 15.11.41

They say that the theoretical size of the library is 10^9 (however I would say it is 9*20^6), and the actually achieved diversity of their library was of size 2.88*10^7 for Leptin and 1.09*10^7. Next step is to identify which junctions have allowed the guest protein to fold properly. For this they devised an autocrine-based selection method using engineered cells that have beta lactamase receptors which have either Leptin or FSH as agonists. A fluoroprobe in the cell responds to the presence of beta lactamase producing a blue color, instead of green and therefore this allows the cells with the active antibody-guest  designed protein (clone) to be identified using FRET-based fluorescence-activated cell sorting.

They managed to identify 6 clones that worked for Leptin and 3 that worked for FSH with the linkers being listed in the below table: Screen Shot 2016-02-03 at 15.49.03

There does not seem to be a pattern emerging from those linker sequences, although one of them repeats itself. For my research it would have been interesting if a pattern did emerge, and then that could be used as a generic linker for future designers. However, this is still another prime example of how Screen Shot 2016-02-03 at 16.05.38well an antibody scaffold can be used a starting point for protein engineering.

As a bonus they also tested in vivo how their designs work and they discovered that the antibody-leptin design (IgG-Leptin) has a longer lifetime. This is probably due to the fact that being a larger protein this is not filtered out by the kidneys.

Identifying basic building blocks/motifs of networks

elec3p

The optimal subgraph decomposition of an electronic circuit.

There are many verbal descriptions for network motifs: characteristic connectivity patterns, over represented subgraphs, recurrent circuits, basic building-blocks of networks just to name a few. However, as with most concepts in network science network motifs are maybe best explained in terms of empirical observations. For instance the most basic example of a network motif is the motif consisting of tree mutually connected nodes that is: a triangle. Many real world networks ranging from the internet to social networks to biological networks contain many more triangles than one would expect if they were wired randomly. In certain cases there exist good explanations for the large number of triangles found in the network. For instance, the presence of many triangles in friendship networks simply tell us that we are more likely to be friends with the friends of our friends. In biological networks triangles and other motifs are believed to contribute to the overall function of the network by performing modular tasks such as information processing and therefore are believed to be favoured by natural selection.

The predominant definition of network motifs is due to Milo et al. [1]  and defines network motifs on the basis of how surprising their frequency in the network is when compared to randomized version of the network. The randomized version of the network is usually taken to be the configuration model i.e. the ensemble of all networks that have the same degree distribution as the original network. Following this definition motifs are identified by comparing their counts in the original network with a large sample of this null model. The approach of Milo et al. formalizes the concept of network motifs as over represented connectivity patterns. However, the results of the method are highly dependent on the choice of null model.

In my talk I presented an alternative approach to motif analysis [2] that seeks to formalize the network motifs from the perspective of simple building blocks. The approach is based on finding an optimal decomposition of the network into subgraphs. Here, subgraph decompositions are defined as subgraph covers which are sets of subgraphs such that every edge of the network is contained in at least one of the subgraphs in the cover. It follows from this definition that a subgraph cover is a representation of the network in the sense that given a subgraph cover the network can be recovered fully by simply taking the union of the edge sets of the subgraphs in the cover. In fact many network representations including edge lists, adjacency lists, bipartite representations and power-graphs fall into the category of subgraph covers. For instance, the edge list representation is equivalent to the cover consisting of all single edge subgraphs of the network and bipartite representations are simply covers which consist of cliques of various sizes.

Given that there are many competing ways of representing a network as a subgraph cover the question of how one picks one of the covers over the other arises. In order to address this problem we consider the total information of subgraph covers as a measure of optimality. The total information is a information measure introduced by Gell-Mann and Hartle [3] which given a model for a certain entity e is defined to be sum of the entropy and effective complexity of M. While the entropy measures the information required to describe e given M and the effective complexity measures the amount of information required to specify M which is given by its algorithmic information content. The total information also provides a framework for model selection:  given two or more models for the same entity one picks the one that has lowest total information and if two models have the same total information one picks the one that has lower effective complexity i.e. the simpler one. This essentially tells us how to trade off goodness of fit and model complexity.

In the context of subgraph covers the entropy of a cover corresponds to the information required to give the positions of the subgraphs in the cover given the different motifs that occur in C and their respective frequencies in C. On the other hand the effective complexity of C corresponds to the information required to describe the set of motifs occurring in the cover together with their respective frequencies. While the entropy of subgraph covers can be calculated analytically their effective complexity is not computable due to the halting problem. However, in practice one can use approximations in the form of upper bounds.

Following the total information approach we now define an optimal subgraph cover of network G to be a subgraph cover that minimizes the total information and the network motifs of G to be the motifs/connectivity patterns that occur in such an optimal cover.
The problem of finding an optimal cover turns out to be computationally rather challenging. Besides the usual difficulties associated to counting subgraphs  (subgraph isomorphism problem-NP complete) and classifying subgraphs (graph isomorphism problem-complexity unknown) the problem is a non-linear set covering problem and therefore NP-hard. Consequently, we construct a greedy heuristic for the problem.

When applied to real world networks the method finds very similar motifs in networks representing similar systems. Moreover, the counts of the motifs in networks of the same type scale approximately with network size. Consequently, the method can also be used to classify networks according to their subgraph structure.

 

References:

[1] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, Network Motifs: Simple Building Blocks of Complex Networks, Science 298, 824 (2002)

[2] Wegner, A. E. Subgraph covers: An information-theoretic approach to motif analysis in
networks. Phys. Rev. X, 4:041026, Nov 2014

[3] M. Gell-Mann and S. Lloyd, Information Measures, Effective Complexity, and Total Information, Complexity 2, 44 (1996).

Novelty in Drug Discovery

The primary aim of drug discovery is to find novel molecules that are active against a target of therapeutic relevance and that are not covered by any existing patents (1).  Due to the increasing cost of research and development in the later stages of drug discovery, and the increase in drug candidates failing at these stages, there is a desire to select the most diverse set of active molecules at the earliest stage of drug discovery, to maximise the chance of finding a molecule that can be optimised into a successful drug (2,3). Computational methods that are both accurate and efficient are one approach to this problem and can augment experiment approaches in deciding which molecules to take forward.

But what do we mean by a “novel” compound? When prioritising molecules for synthesis which characteristics do we want to be different?  It was once common to select subsets of hits to maximise chemical diversity in order to cover as much chemical space as possible (4).  These novel lead molecules could subsequently be optimised, the idea that maximising the coverage of chemical space would maximise the chance of finding a molecule that could be optimised successfully. More recently however, the focus has shifted to “biodiversity”: diversity in terms of how the molecule interacts with the protein (1). Activity cliffs, pairs of molecules that are structurally and chemically similar but have a large difference in potency, indicate that chemical diversity may not be the best descriptor to identify molecules that interact with the target in sufficiently diverse ways. The molecules to be taken forward should be both active against the target and diverse in terms of how they interact with the target, and the parts of the binding site the molecule interacts with.

This raises two interesting ideas. The first is prioritising molecules that form the same interactions as molecules known to bind but are chemically different: scaffold hopping (5). The second is prioritising molecules that potentially form different interactions to known binders. I hope to explore this in the coming months as part of my research.

References

(1) J. K. Medina-Franco et al., Expert Opin. Drug Discov., 2014, 9, 151-156.

(2) A. S. A. Roy, Project FDA Report, 2012, 5.

(3) J. Avorn, New England Journ. of Med., 2015, 372, 1877-1879.

(4)  P. Willet, Journ. Comp. Bio., 1999, 6, 447-457.

(5) H. Zhao, Drug Discov. Today,  2007, 12, 149–155.