# Convergent affinity maturation.

Antibodies are the first line of defense of our organisms against noxious substances. They are the proteins which we ‘train’ to recognize noxious substances when we get immunized. Therefore understanding the immune response after being presented with an antigen is instrumental in developing novel vaccines.

One hypothesis relating to immune response to an antigen is that different organisms are likely to raise similar or even identical antibodies against the same antigen. Testing this hypothesis has become more realistic recently with the advent of Next Generation Sequencing technologies (NGS). Using NGS techniques it is possible to interrogate the sequential makeup of a large set of B-cells.

Such a study was conducted not that long time ago by Trueck et al. They have analysed antibody repertoires of five individuals, pre and post immunization to check if the immune systems converged on similar antibody sequences. The five individuals were immunized with a conjugate vaccine of HiB, MenC and TT. The antibodies were sequenced from cells extracted pre-vaccination and seven days after vaccination.

Firstly, the antibody repertoire appeared to reflect the fact that an organism was mounting an immune response as the clonality post vaccination was higher than before vaccination (more cells producing similar antibodies). Secondly, authors focused on identifying sequences from the public repertoire — those antibodies that are shared between individuals. This analysis focused on the CDR3 only, of which 47 were shared between at least two of the five individuals. Quite a large proportion of those sequences were known to be specific towards HiB and the enrichment of these was muhch higher in the post-vaccination sample. Only one sequence in this set was known previously to target TT. Nevertheless, relaxing the sequence similarity condition, a lot of sequences related to those known to be TT-specific were found among the five individuals. Most importantly, the number of such sequences was much higher in the post-vaccination samples, indicating that these might indeed have been raised in response to TT stimulation. The same was not true for MenC as hardly any sequences related to this antigen were found in the immune response of the five individuals.

Therefore, authors claim that looking at the enrichment of such shared sequences can be an indicator of the effectiveness of the immune response. They correlate statistics coming from looking at the number of shared sequences which appear to have moderate correlation to the antibody avidity data (even though p-values in some cases are quite high). This indicates that even in such a small set of individuals, antibodies are capable of converging on similar solutions. This might provide clues as to the characteristics antibodies that recognize specific antigens and thus facilitate novel vaccine design.

# Is “fragment-based” still the way forward in template-free protein structure prediction?

Out of the many questions surrounding the notion that you can predict a protein’s structure from its sequence, there is one in particular that I decided to tackle during last group meeting.

Protein structure prediction is a hard problem (do I sound repetitive?). One of the many cop outs employed by the structure prediction community is the idea that you can break down known structures into fragments and use these protein pieces to perform predictions. This is known as fragment-assembly or fragment-based template-free protein structure prediction.

As absurd as the idea may seem, there is robust evidence that suggests that this is actually a viable strategy. There is a notion that the fragment space is complete; you can reconstruct the backbone of any known structure based on the torsion angles of fragments from other structures. In less technical jargon, you can effectively use fragments and combine them to re-create any of the protein structures that we know and to a fairly acceptable level of precision.

So, technically, it is possible to predict a protein structure using fragments from other structures. In practice, you are still left with the problem of choosing the right fragments to model your sequence of interest. How easy do you think that is?

We can look at this question in light of observations that were made back in the early 80s. Kabsch and Sander reported that two protein fragments having exactly the same sequence can present completely different structures [1]. This complies with the notion that global properties can affect and even define local structure, which in turn suggests that selecting the right fragments to assemble a structure is not necessarily a straightforward process.

The starting point for protein structure prediction is a sequence. Since we are talking about template-free protein structure prediction, it is safe to assume that there is no good global sequence match to your target with a known structure (otherwise you would use that match/structure as a template). Hence, fragment selection is restricted to local sequence similarity, which, as suggested in the previous paragraph, is not necessarily ideal.

On the other hand, we are becoming increasingly more accurate in inferring one-dimensional properties from a protein’s sequence. These properties can and often are used to enhance our fragment-selection capabilities. Yet, even using the state-of-the-art in secondary structure and torsion angle prediction, fragment selection is still fairly imprecise.

During group meeting I highlighted a possible contrast between practical fragment space and general (or possible) fragment space. My premise is simple.  I define practical fragment space as the fragments that we can accurately select from the possible fragment space to model protein structures. In my opinion, it would be extremely interesting to quantify the difference between the two. This would answer the fundamental question of how useful fragment-assembly actually is. More importantly, it would help the community make an educated decision in regards to whether template-free structure prediction strategies should shift from fragment-based to ones based on distance constraints, an approach that is gaining popularity due to the success of contact predictions.

I am very keen to investigate this further. Maybe for my next blog post, we will have an answer! Stay tuned.

# Journal Club: “Discriminative Chemical Patterns: Automatic and Interactive Design”

For Journal Club this week I decided to discuss the following paper by M. Rarey et al., which describes a method of using SMARTS patterns to discriminate between two sets of molecules. Link to paper here.

Given two sets of molecules can one generate a pattern that discriminates between two sets? This relates to a key question in drug design: can we predict whether molecules bind or not given a set of binders and a set of non-binders. The method is of particular interest because it makes use of data available, unlike conventional methods. However, for this technique to work, the correct molecular classification is required to discriminate between the two sets of molecules.

Originally molecules were classified using physiochemical properties for example, molecular weight or log P. However these classifications are too general and do not encompass enough molecular detail for accurate discrimination. An alternative is to using topological fingerprints. These encode a set the presence of a set of topological features using a series of bits. One of the limitations with this classification is that it is restricted by the predefined set of structures and features. This method makes use of chemical patterns which advantageously can can classify a chemical feature that cannot be sufficiently described by molecular substructure.

SMARTS (a molecular description language based on SMILES) allow description of structures with varying levels of specificity. For example one can specify atomic element, whether the atom is a subset of elements, whether it is aliphatic or aromatic, or whether it is in a ring. The method makes use of this description of molecules as the group have already developed some software to visualise SMARTS strings and modify them: the SMARTSeditor.

The method involves combining automatic pattern generation and visualisation to form SMARTSminer. Given two distinct molecule sets, the algorithm derives connected chemical patterns to differentiate both sets by using a sub-graph mining technique: solutions are extended by single elements iteratively.

The SMARTSminer was then used to test a series of test cases using the DUD (Database of Useful Decoys) data set. This seems strange when the data set has been shown to be inaccurate and perhaps there are more accurate test sets available, such as DUDe (Database of Useful Decoys enchanced). Let us look a couple of these case sets in more detail.

1. Discrimination between Active Molecules on Similar Targets

The first case set looks at discriminating between molecules that are active for COX-1 and COX-2. COX proteins are cyclooxygenase that are involved in inflammatory reaction. These proteins are targeted by inhibitors such as aspirin and ibuprofen for the relief of inflammatin and pain. Both COX-1 and COX-2 are similar targets with similar molecular weight and 65% sequence identity. Selective inhibition is only due to a difference in residue at position 523.

Separation of the sets of molecules was possible with a pattern identified that hit 21/25 of the molecules active for COX-1 and 15/348 of molecules of molecules active for COX-2. When the positive and negative set are reversed a pattern is identified that matched 313/348 of COX-2 actives but only 1 of the COX-1 ligands. The group state that perfect separation is not possible as there is an overlap of 2 molecules.

It is interesting that patterns were identified that could discriminate between the two sets. However, there is no discussion of how to use this information. Additionally the pattern determined has not been tested on any molecules outside of the training set – there are no blind tests. This seems strange as a blind test could emphasise the usefulness of this method if it was successful.

2. Discrimination between Active and Inactive Molecules

The second case investigates determining whether a pattern can be generated that discriminates between active and inactive targets. The test case used target SAHH (S-adenosyl-homocysteine hydrolase). A pattern was generated that matched all active molecules and only 1% of inactives. What is particularly exciting is that the pattern found contains part of the interaction network hydrogen bonding partners of the ligand, as shown in the figure below (the pattern identified is highlighted in green).

I find it very surprising that the group did not follow up with blind tests of molecules not used in the training set – especially as the pattern identified a key part of the binding mechanism.

To summarise a new method, SMARTSminer, calculates discriminative patterns between two sets of molecules using the SMARTS language. The authors state that the method has shown applicability in several use cases covering the application of actives vs decoys, kinase classifications, analysis of data sets and characterisation of reaction centers. However, I’m not sure I can agree with that statement. I believe further blind tests would be required to prove the applicability of the method once the pattern has been found. I also believe that an analysis of whether the pattern is over fitted to the training data is also required.

# Do we need the constant regions of Antibodies and T-cell receptors in molecular simulations?

At this week’s journal club I presented my latest results on the effect of the constant regions of antibodies (ABs) and T-cell receptors (TCRs) on the dynamics of the overall system. Not including constant regions in such simulations is a commonly used simplification that is found throughout the literature. This is mainly due a massive saving in computational runtime as illustrated below:

The constant regions contain about 210 residues but an additional speed up comes from the much smaller solvation box. If a cubic solvation box is used then the effect is even more severe:

But the question is: “Is is OK to remove the constant regions of an AB or TCR and simulate without them?”.

Using replica simulations we found that simulations with and without constant regions lead to (on average) significantly different results. The detail of our analysis will soon be submitted to a scientific journal. The current working title is “Why constant regions are essential in antibody and T-cell receptor Molecular Dynamics simulations”.

# Visualising Biological Data, Pt. 1

I had the privilege to go down to Heidelberg last week to go and see some stunning posters and artwork. I really recommend that you check some of the posters out. In particular, the “Green Fluorescent Protein” poster stuck out as my favourite. Also, if you’re a real Twitter geek, check out #Vizbi for some more tweets throughout the week.

So what did the conference entail? As a very blunt summary, it was really an eclectic collection of researchers around the globe who showcased their research with very neat visual media. While I was hoping for a conference that gave an overview of some of the principles that dictate how to visualise proteins, genes, etc., it wasn’t like that at all! Although I was initially a bit disappointed, it turned out to be better – one of the key themes that were re-iterated throughout the conference is that visualisations are dependent on the application!

From the week, these are the top 5 lessons I walked away with, and I hope you can integrate this into your own visualisation:

1. There is no pre-defined, accepted way of visualising data. Basically, every visualisation is tailored, has a specific purpose, so don’t try to fit your graph into something pretty that you’ve seen in another paper. We’re encouraged to get insight from others, but not necessarily replicate a graph.
2. KISS (Keep it simple, stupid!) Occam’s razor, KISS, whatever you want to call it – keep things simple. Making an overly complicated visualisation may backfire.
3. Remember your colours. Colour is probably one of the most powerful tools in our arsenal for making the most of a visualisation. Don’t ignore them, and make sure that they’re clean, separate, and interpretable — even to those who are colour-blind!
4. Visualisation is a means of exploration and explanation. Make lots, and lots of prototypes of data visuals. It will not only help you explore the underlying patterns in your data, but help you to develop the skills in explaining your data.
5. Don’t forget the people. Basically, a visualisation is really for a specific target audience, not for a machine. What you’re doing is to encourage connections, share knowledge, and create an experience so that people can learn your data.

I’ll come back in a few weeks’ time after reviewing some tools, stay tuned!

# Journal Club: Comments on Three X-ray Crystal Structure Papers

One of the fundamental weaknesses of X-ray crystallography, when used for the solution of macromolecular structures, is that the constructed models are based on the subjective interpretation of the electron density by the crystallographer.

This can lead to poor or simply incorrect models, as discussed by Stanfield et al. in their recent paper “Comment on Three X-ray Crystal Structure Papers” (link below). Here, they assert that the basis of several papers by Dr. Salunke and his coworkers, a series of antibody-peptide complexes, are fundamentally flawed. It is argued that the experimental electron density does not support the presence of the peptide models: there is no significant positive OMIT density for the peptides when they are removed from the model, and the quality of the constructed models is poor, with unreasonably large B-factors.

Firstly, a quick recap on crystallographic maps and how they are used. Two map types are principally used in macromolecular crystallography: composite maps and difference maps.

The composite map is used to approximate the electron density of the crystal. It consists of the modelled density subtracted from the observed electron density (multiplied by 2) and contains correction factors to minimise phase bias (m, D). The weighting of 2 of the observed map is used to compensate for the poor phases, which cause un-modelled features to appear only weakly in the density. It is universally represented as a blue mesh:

The difference map is the modelled density subtracted from the observed density, and is used to identify un-modelled areas of the electron density. It contains the same correction factors to compensate for phase bias. It is universally represented as a green mesh for positive values, and a red mesh for negative values. The green and red meshes are always contoured at the same absolute values, e.g. ±1 or ±1.4.

The problem of identifying features to model in the electron density is the point where the subjectivity of the crystallographer is most influential. For ligands, this means identifying blobs that are “significant”, and that match the shape of the molecule to be modelled.

When a crystallographer is actively searching for the presence of a binding molecule, in this case a peptide, it is easy to misinterpret density as the molecule you are searching for. You have to be disciplined and highly critical before accidentally contouring to levels that are too low, and modelling into density that does not really match the model. This is the case in the series of structures that are criticised by Stanfield et al.

## Specific concerns with the structures of Dr Salenke et al.

### 1: Contouring difference maps at only positive values (and colouring them blue?)

The first questionable thing that Dr Salunke et al do is to present a difference map contoured at only positive values as evidence for the bound peptide. This oddity is compounded by colouring the resulting map as blue, which is unusual for a difference map.

### 2: Contouring difference maps to low values

Salunke et al claim that the image above shows adequate evidence for the binding of the peptide in a difference map contoured at 1.7𝛔.

When contouring difference maps to such low levels, weak features will indeed be detectable, if they are present, but in solvent channels, where the crystal is an ensemble of disordered states, there is no way to interpret the density as an atomic model. Hence, a difference map at 1.7𝛔 will show blobs across all of the solvent channels in the crystal.

This fact, in itself, does not prove that the model is wrong, but makes it highly likely that the model is a result of observation bias. This observation bias occurs because the authors were looking for evidence of the binding peptide, and so inspected the density at the binding site. This has lead to the over-interpretation of noisy and meaningless density as the peptide.

The reason that the 3𝛔 limit is used to identify crystallographic features in difference maps is that this identifies only strong un-modelled features that are unlikely to be noise, or a disordered feature.

More worryingly, the model does not actually fit the density very well.

### 3: Poor Model Quality

Lastly, the quality of the modelled peptides is very poor. The B-factors of the ligands are much higher than the B-factors of the surrounding protein side-chains. This is symptomatic of the modelled feature not being present in the data, and the refinement program tries to “erase” the presence of the model by inflating the B-factors. This, again, does not prove that the model is wrong, but highlights the poor quality of the model.

Lastly, the ramachandran outliers in the peptides are extreme, with values in the 0th percentile of empirical values. This means that the conformer of the peptide is highly strained, and therefore highly unlikely.

Combining all of the evidence above, as presented in the article written by Stanfield et al, there is little doubt that the models presented by Salunke et al are incorrect. Individual failings in the models in one area could be explained, but such a range of errors across such a range of quality metrics cannot.

# Network Comparison

### Why network comparison?

Many complex systems can be represented as networks, including friendships (e.g. Facebook), the World Wide Web trade relations and biological interactions. For a friendship network, for example, individuals are represented as nodes and an edge between two nodes represents a friendship. The study of networks has thus been a very active area of research in recent years, and, in particular, network comparison has become increasingly relevant. Network comparison, itself, has many wide-ranging applications, for example, comparing protein-protein interaction networks could lead to increased understanding of underlying biological processes. Network comparison can also be used to study the evolution of networks over time and for identifying sudden changes and shocks.

An example of a network.

##### How do we compare networks?

There are numerous methods that can be used to compare networks, including alignment methods, fitting existing models,
global properties such as density of the network, and comparisons based on local structure. As a very simple example, one could base comparisons on a single summary statistic such as the number of triangles in each network. If there was a significant difference between these counts (relative to the number of nodes in each network) then we would conclude that the networks are different; for example, one may be a social network in which triangles are common – “friends of friends are friends”. However, this a very crude approach and is often not helpful to the problem of determining whether the two networks are similar. Real-world networks can be very large, are often deeply inhomogeneous and have multitude of properties, which makes the problem of network comparison very challenging.

### A network comparison methodology: Netdis

Here, we describe a recently introduced network comparison methodology. At the heart of this methodology is a topology-based similarity measure between networks, Netdis [1]. The Netdis statistic assigns a value between 0 and 1 (close to 1 for very good matches between networks and close to 0 for similar networks) and, consequently, allows many networks to be compared simultaneously via their Netdis values.

##### The method

Let us now describe how the Netdis statistic is obtained and used for comparison of the networks G and H with n and m nodes respectively.

For a network G, pick a node i and obtain its two-step ego-network. That is, the network induced by collection of all nodes in G that are connected to i via a path containing at most two edges. By induced we mean that a edge is present in the two-step ego-network of i if and only if it is also present in the original network G. We then count the number of times that various subgraphs occur in the ego-network, which we denote by $latex N_{w,i}(G)$ for subgraph w. For computational reasons, this is typically restricted to subgraphs on 5 or fewer nodes. This processes is repeated for all nodes in G, for fixed k=3,4,5.

1. Under an appropriately chosen null model, an expected value for the quantities $latex N_{w,i}(G)$ is given, denoted by $latex E_w^i(G)$. We omit some of these details here, but the idea is to centre the quantities $latex N_{w,i}(G)$ to remove background noise from an individual networks.
2. Under an appropriately chosen null model, an expected value for the quantities $latex N_{w,i}(G)$ is given, denoted by $latex E_w^i(G)$.  We omit some of these details here, but the idea is to centre the quantities $latex N_{w,i}(G)$ to remove background noise from an individual networks.
3. Calculate:
4. To compare networks G and H, define:where A(k) is the set of all subgraphs on k nodes andis a normalising constant that ensures that the statistic $latex netD_2^S(k)$  takes values between -1 and 1.  The corresponding Netdis statistic is:  which now takes values in the interval between 0 and 1.
5. The pairwise Netdis values from the equation above are then used to build a similarity matrix for all query networks. This can be done for any $latex k \geq3$, but for computational reasons, this typically needs to be limited to $latex k\leq5$. Note that for     k=3,4,5 we obtain three different distance matrices.
6. The performance of Netdis can be assessed by comparing the nearest neighbour    assignments of networks according to Netdis with a ‘ground truth’ or ‘reference’ clustering. A  network is said to have a correct nearest neighbour whenever its nearest neighbour according to Netdis is in the same cluster as the network itself.  The overall performance of   Netdis on a given data set can then be quantified using the nearest neighbour score (NN),   which for a given set of networks is defined to be the fraction of networks that are assigned correct nearest neighbours by Netdis.

The phylogenetic tree obtained by Netdis for protein interaction networks. The tree agrees with the currently accepted phylogeny between these species.

##### Why Netdis?

The Netdis methodology has been shown to be effective at correctly clustering networks from a variety of data sets, including both model networks and real world networks, such Facebook. In particular, the methodology allowed for the correct phylogenetic tree for five species (human, yeast, fly, hpylori and ecoli) to be obtained from a Netdis comparison of their protein-protein interaction networks. Desirable properties of the Netdis methodology are the following:

\item The statistic is based on counts of small subgraphs (for example triangles) in local neighbourhoods of nodes. By taking into account a variety of subgraphs, we capture the topology more effectively than by just considering a single summary statistic (such as number of triangles). Also, by considering local neighbourhoods, rather than global summaries, we can often deal more effectively with inhomogeneous graphs.

• The Netdis statistic contains a centring by subtracting background expectations from a null model. This ensures that the statistic is not dominating by noise from individual networks.
• The statistic also contains a rescaling to ensure that counts of certain commonly represented subgraphs do not dominate the statistic. This also allows for effective comparison even when the networks we are comparing have a different number of nodes.
• The statistic is normalised to take values between 0 and 1 (close to 1 for very good matches between networks and close to 0 for similar networks). The statistic gives values between 0 and 1 and based on this number, we can simultaneously compare many networks; networks with Netdis value close to one can be clustered together. This offers the possibility of network phylogeny reconstruction.
##### A new variant of Netdis: subsampling

The performance of Netdis under subsampling for a data set consisting of protein interaction networks. The performance of Netdis starts to deteriorate significantly only after less than 10% of ego networks are sampled.

Despite the power of Netdis as an effective network comparison method, like many other network comparison methods, it can become computationally expensive for large networks. In such situations the following variant of Netdis may be preferable (see [2]). This variant works by only querying a small subsample of the nodes in each network. An analogous Netdis statistic is then computed based on subgraph counts in the two-step ego networks of the sampled nodes. From numerous simulation studies and experimentations, it has been shown that this statistic based on subsampling is almost as effective as Netdis provided that at least 5 percent of the nodes in each network are sampled, with the new statistic only really dropping off significantly when fewer than 1 percent of nodes are sampled. Remarkably, this procedure works well for inhomogeneous real-world networks, and not just for networks realised from classical homogeneous random graphs, in which case one would not be surprised that the procedure works.

##### Other network comparison methods

Finally, we note that Netdis is one of many network comparison methodologies present in the literature Other popular network comparison methodologies include GCD [3], GDDA [4], GHOST [5], MI-Graal [6] and NETAL [7].

[1]  Ali W., Rito, T., Reinert, G., Sun, F. and Deane, C. M. Alignment-free protein
interaction network comparison. Bioinformatics 30 (2014), pp. i430–i437.

[2] Ali, W., Wegner, A. E., Gaunt, R. E., Deane, C. M. and Reinert, G. Comparison of
large networks with sub-sampling strategies. Submitted, 2015.

[3] Yaveroglu, O. N., Malod-Dognin, N., Davis, D., Levnajic, Z., Janjic, V., Karapandza,
R., Stojmirovic, A. and Prˇzulj, N. Revealing the hidden language of complex networks. Scientific Reports 4 Article number: 4547, (2014)

[4] Przulj, N. Biological network comparison using graphlet degree distribution. Bioinformatics 23 (2007), pp. e177–e183.

[5] Patro, R. and Kingsford, C. Global network alignment using multiscale spectral
signatures. Bioinformatics 28 (2012), pp. 3105–3114.

[6] Kuchaiev, O. and Przulj, N. Integrative network alignment reveals large regions of
global network similarity in yeast and human. Bioinformatics 27 (2011), pp. 1390–
1396.

[7] Neyshabur, B., Khadem, A., Hashemifar, S. and Arab, S. S. NETAL: a new graph-
based method for global alignment of protein–protein interaction networks. Bioinformatics 27 (2013), pp. 1654–1662.