# Wilcoxon-Mann-Whitney test and a small sample size

The Wilcoxon Mann Whitney test (two samples), is a non-parametric test used to compare if the distributions of two populations are shifted , i.e. say \$latex f_1(x) =f_2(x+k)\$ where k is the shift between the two distributions, thus if k=0 then the two populations are actually the same one. This test is based in the rank of the observations of the two samples, which means that it won’t take into account how big the differences between the values of the two samples are, e.g. if performing a WMW test comparing S1=(1,2) and S2=(100,300) it wouldn’t differ of comparing S1=(1,2) and S2=(4,5). Therefore when having a small sample size this is a great loss of information.

Now, what happens when you perform a WMW test on samples of size 2 and 2 and they are as different as they can be (to what the test concerns), lest say S1=(1,2) and S2=(4,5). Then the p-value of this test would be 0.333, which means that the smallest p-value you can obtain from a WMW test when comparing two samples of size 2 and 2 is 0.3333. Hence you would only be able to detect differences between the two samples when using a level of significance greater than 0.333 .

Finally you must understand that having a sample of two is usually not enough for a statistical test. The following table shows the smallest  p-value for different small sample sizes when the alternative hypothesis is two sided. (Values in the table are rounded).

# CABS-flex

One of the still open questions in bioinformatics is that of the flexibility of proteins, and it is one in which I am quite interested. Our main source of structural information is X-ray diffraction experiments, in which the crystalline protein is intrinsically rigid. While there is an ever increasing body of NMR data, and Molecular Dynamics simulations are becoming faster and cheaper, complete information about the dynamics of every PDB structure is a long way off. All atom MD simulations for a complete protein still take either days, or a powerful computer. So, naturally, any papers that talk about ways to get flexibility information catch my attention.

The paper that caught my attention was CABS-flex: Server for fast simulation of protein structure fluctuations.. There also is a connected paper – Consistent View of Protein Fluctuations from All-Atom Molecular Dynamics and Coarse-Grained Dynamics with Knowledge-Based Force-Field – which benchmarks their method against MD simulations.

The CABS model pops up in a number of other places – there is a CABS-fold server, and the method was first described in the paper Consistent View of Protein Fluctuations from All-Atom Molecular Dynamics and Coarse-Grained Dynamics with Knowledge-Based Force-Field from 2004.

## What does the webserver do?

You give it a coordinate file, wait some hours, and it gives you back a series of (clustered) structures, a C&alpha trajectory, a residue flexibility profile, and a nice little video. Which is nice. It is, however, a little limited; you can only do a single chain (so no modelling of small molecule or peptide interactions), there is no way of fixing part of the model so it does not flex, and it can be picky about your PDB files – no chain gaps, no strange amino acids. You can tell it to analyse a PDB chain by just entering the code, but these are frequently rejected for having unacceptable coordinate files.

## How does it work?

The CABS model makes a reduced representation of the protein, and explores its conformational space by making moves – which are accepted or rejected based on a (comparatively complex) force field. Many moves are attempted, and over a large number of iterations, this builds up an ensemble of structures, which show the possible confirmations of the protein. So, its a bit like MD, but rather than moving atoms based on calculated forces, you make random moves, and accept the sensible ones.

The model is shown above. The protein structure is reduced to 4 atoms per residue. There are 5 types of move, and in each step, a random move of each type is tried ~n times (n being the number of amino acids in the protein). These are accepted or rejected based on their force field, and the process is repeated several hundred times. The force field used to accept/reject moves is quite complex – there are short range (sequence dependent and sequence independent), hydrogen bond, and long range (again, sequence dependent and sequence independent) interactions. There are sub-categories within these, and the relative contributions of the interactions can be changed. There is no solvent, so the long range interactions have to provide the forces to keep the protein together, and the hydrogen bond interactions act to maintain the secondary structure elements. More detail can be found in the 2004 paper.

## Is it true?

The authors justify the method by comparing it to MD simulations taken from a database, and conclude that CABS-flex gives answers as similar to MD simulations as MD simulations using different force fields are to each other. Simply put, the same residues move in their simulations as move in MD simulations. They move more, which backs up their claim that they demonstrate flexibility over a longer time-frame than MD simulations. They do admit that they get the best agreement when they heavily restrict their secondary structure – raising the question of how much of the agreement between all of the methods is down to all the simulations having the same secondary structure.

To conclude, this could be a useful method, particularly in that it can give long time-frame flexibility – providing you treat it as any other measure of flexibility, with a pinch of salt. It is a bit of a shame that the interface is (perhaps frustratingly) simple, and you need to pre-process your coordinate files, as many PDB coordinate files will not be accepted.

# A Colourblind Guide to Colourful Presentations…

Like many people, I am colourblind.

Fortunately I am only ‘mildly’ red-green colourblind and it doesn’t have a huge detrimental effect on my life.

Firstly, to dispel a couple of misconceptions:

1. I can still see colour. ‘blindness’ here would be better called ‘deficiency’ or ‘desensitivity’. I am simply less sensitive to reds/greens than the ‘normal’ eye. Whilst I can discriminate between blues/yellows, it is harder to distinguish some reds from some greens.
2. Colour blindness is not the swapping of colours. I don’t accidentally call red and green the wrong things – I just can’t tell what a colour is in some cases.
3. I have no problem with traffic lights.
4. Colour blindness does not mean poor eyesight. My cornea, lens, etc work fine, thank-you-very-much.

Approximately 8% of men and 0.5% of women are colourblind to various extents. There is a wide range of types, and severities, of colourblindness. For more information, there are a number of websites with helpful descriptions – This, for example…

There’s even a nature paper about colour blindness awareness…

The standard tests for colour-blindness are the well-recognised Ishihara colour tests. Lets do a few (just for fun)…

An example Ishihara Colour Test. Most people can see the ’12’ in the image. Image: Wikipedia

Another Colour Test. You might be able to see a ‘6’ in the image (I can’t…). Image: Wikipedia

Another Colour Test. You should see nothing in this image. I see a ‘2’. Image: Wikipedia

The last one. You should see a ’42’. I can just about see a nondescript blur that might be a ‘4’ on the left hand side. Image: Wikipedia

To give an idea of what it’s like, this page gives a very good example. For a theatre booking system, they indicate the seats that offer a restricted view of the stage –

Restricted view seats are clearly indicated – or are they? Image: www.digitalartsonline.co.uk

Whilst most people will be able to tell where the best seats are, for those with colour blindness it might not be so easy. The image below shows the same image from the point of view of someone with colour blindness – can you still be sure of which seat is which?

Still clear? Image: www.digitalartsonline.co.uk

Mostly, being colourblind doesn’t affect my life (when I’m not at the theatre). However, there is one area of my life where being colourblind is *really* annoying: presentations (and picking ties to match shirts, but I’ve got that figured out now).

So here’s the Nick-approved guide to making colour-blind friendly presentations.

1. Choose a colour scheme that is colour-blind friendly – these are readily available online. This is mainly for graphs. Just generally avoid pale green-pale red mixtures. Purples and pinks can also be pretty confusing.
2. Along with the above, high contrast colour schemes can be very hard to see. For instance, a presentation with a white background can make it difficult to see coloured things on the slide, as everything is drowned out by the white background – especially yellow/green text. It is also very tiring to the eye. Try dark-coloured fonts on a light-coloured background.
3. In graphs, don’t just use colours to match lines to the legend – matching colours from lines to the colours on the legend is hard – use shapes as well, or label the lines. An example.
4. If 3. is impossible, make the lines on graphs a decent thickness – small areas of colour are harder to determine.
5. When referring to slide, try not to refer to ‘the red box’. Refer instead to ‘the rounded red box in the top-right of the screen’.
6. Please don’t use red laser pointers – these are evil [citation needed]. The red light is not easily distinguishable on bright screens (or if it’s zipping around the screen). Use a green laser pointer instead. Not only are green laser pointers generally more powerful, and therefore brighter, but they are also easier to see. Why?

For a fairly comprehensive guide of how to make colour-friendly presentations, look at this page. And for checking how things might look, there are many colour-blind simulators for both images and webpages.

I hope this helps to create colour-friendly presentations.

# From Protein Interface Recognition to Protein Complex Prediction

Similarly to ‘words’, which need to be “assembled into sentences, paragraphs, chapters and books” to tell a story, ‘protein structures’ need to be assembled into protein complexes to perform a specific task. To form complexes, proteins interact with other proteins, DNA, RNA and small molecules using their interface residues. All those types of interactions are under intense scrutiny by the research community, each of them defining a distinct field of research. During my PhD I focused on protein-protein interactions (PPIs) and prediction of their interfaces. Modifications in PPIs affect the events that take place within cells which may lead to critical diseases such as cancer. Therefore, knowledge about PPIs and their resulting 3D complexes can provide key information for drug design.

Docking is a popular computational method which predicts the possible structure of the complex produced by two proteins using the known 3D structure of the individual proteins. However, docking of two proteins can result in a large number of different conformational models whose majority is far from correct. This highlights one of the main limitations of docking.  Therefore, scoring functions have been proposed which are used to re-score and re-rank docked conformations in order to detect near-native models. One way to distinguish native-like models from false docked poses is to use knowledge of protein interfaces. If one knows the possible location of interface residues on each individual protein, docked complexes which do not involve those interfaces can be rejected. Therefore, accurate prediction of protein interfaces can assist with detection of native-like conformations.

Various methods have been proposed for predicting protein interfaces as mentioned above. A high number of methods investigate protein sequential or structural features in order to characterise protein interfaces. Usage of 3D structural properties has improved the sequence-based predictions.  Moreover, evolutionary conservation was shown to be an important property. Therefore, methods have integrated various structural features along with evolutionary information to increase performance.

The combination of different features using various techniques has been investigated by intrinsic-based predictors. However, it seems that these methods have reached their saturation, and combination of more properties does not improve their prediction performance. On the other hand, many studies have investigated the 3D structure of binding sites among protein families. They discovered that the binding site localisation and structure are conserved among homologous. These properties have improved the detection of functional residues and protein-ligand binding sites. Therefore, predictors took advantage of structurally conserved residues among homologous proteins to improve binding site predictions.

Although homologous template-based predictors improve the predictions, they are limited to those proteins whose homologous structure exists. Therefore, methods have extended their search for templates to structural neighbours, since interface conservation exists even among remote structural neighbours. In addition, with the increase in experimentally determined 3D complexes good quality templates can be found for many proteins. Therefore, usage of structural neighbours is the current focus of template-based protein interfaces predictors.

Although, template-based methods are currently the predictors under the main focus, one of their main limitations is their dependency to availability of the QP 3D structure. Also, these predictors have not investigated the contribution of interacting partners of structural neighbours in the prediction. In addition, since these methods perform structural comparisons their computational time is high which limits their application to high-throughput predictions.

One of my PhD contributions was toward developing, T-PIP (Template based Protein Interface Prediction), a novel PIP approach based on homologous structural neighbours’ information. T-PIP addresses the above mentioned limitations by quantifying, first, homology between QP and its structural neighbours and, second, the diversity between the ligands of the structural neighbours (here, ligands refers to the interacting partners of proteins). Finally, predictions can be performed for sequences of unknown structure if that of a homologous protein is available. T-PIP’s main contribution is the weighted score assigned to each residue of QP, which takes into account not only the degree of similarity between structural neighbours, but also the nature of their interacting partners.

In addition, we used T-PIP prediction to re-rank docking conformations which resulted in T-PioDock (Template based Protein Interface prediction and protein interface Overlap for Docking model scoring), a complete framework for prediction of a complex 3D structure. T-PioDock supports the identification of near-native conformations from 3D models that docking software produced by scoring those models using binding interfaces predicted by T-PIP.

T-PioDock Pipeline

Exhaustive evaluation of interface predictors on standard benchmark datasets has confirmed the superiority of template base approaches and has showed that the T-PIP methodology performs best. Moreover, comparison between T-PioDock and other state-of-the-art scoring methods has revealed that the proposed approach outperforms all its competitors.

Accurate identification of near-native conformations remains a challenging task. Although availability of 3D complexes will benefit to template based methods such as T-PioDock, we have identified specific limitations which need to be addressed. First, docking software are still not able to produce native like models for every target. Second, current interface predictors do not explicitly refer to pair-wise residue interactions which leaves ambiguity when assessing quality of complex conformations.

# Life in Colour – Vim

Among programmers, there are occasional debates on what editor is best — some love Eclipse, some are die-hard Emacs supporters, or some have no preference, and use the default text editor(s) with their OS. Whatever your choice, you can never underestimate how useful Vim can be, e.g. if you SSH into another machine. And so, here is a vim config that I’ve been using (thanks to Ben Frot), which makes your vim environment very colourful and easy to read. Code available here.

Plus, you can do awesome things in vim:

Edit multiple files in Vim. Can get a little crazy but, hey, why not?

So, to do some of the crazier things (e.g. what I’ve shown in this blog post), try this:

```# Open a file of choice
:e blah1.md

# First split to two screens; change between screens by Ctrl + ww
:split

# Now open a second file
:e blah2.md

# Repeat for more screens & lines.```

Happy vim-ing!

# Conservation of codon optimality

The unidirectional process of translation converts one dimensional mRNA sequences into three dimensional protein structures. The reactant, the mRNA, is built from 64 codon variants, while the product, the protein, is constructed from the 20 biologically relevant amino acids. This reduction of 64 variants to a mere 20 is indicative of the degeneracy contained within the genetic code. Multiple codons encode for the same amino acid, and hence any given protein can be encoded by a multitude of synonymous mRNA sequences.  Current structural biology research often assumes that these synonymous mRNA sequences are equivalent; the ability of certain proteins to unfold and refold without detrimental effects leading to the assumption that the information pertaining to the folded structure is contained exclusively within the protein sequence. In actuality, experimental evidence is mounting that shows synonymous sequences can in fact produce proteins with different properties; synonymous codon switches having been observed to cause a wide range of detrimental effects, such as decreases in enzyme activity, reductions in solubility, and even causing misfolding.

The ribosome (yellow) passes along the mRNA, turning the sequence of codons into a protein. Under our model, the speed of the translation for each codon varies, in turn differing the time available to the nascent peptide chain to explore the fold space. Through this method codon choice becomes an additional source of structural information.

For my 10 week DTC project within the Deane group,  I was tasked with resurrecting the Coding Sequence and Structure database (CSandS; read: Sea Sands) and using it to test for the evolutionary conservation of codon choice over groups of proteins with similar structures. With experimental differences noted in protein product structure between synonymous sequences, we wished to investigate if codon choice is an evolutionary constraint on protein structure, and if so, to what degree, and in what manner. To test for evolutionary conservation we combined the theories of codon optimality and cotranslational folding, our hypothesis being that the choice of codon directly affects the translation efficiency of the ribosome; consequently different codons give the nascent polypeptide emerging from the ribosome varying amounts of time to explore the fold-space.  By assigning optimality scores to each codon, we can search for regions of fast translation and slow translation, then by looking for correlations within aligned sets of structural similar proteins we can identify sites where codon choice is of importance. While the 10 weeks project focussed mainly on methodology and implementation, my preliminary results indicate that a large percentage of proteins are under selective pressures with regards to codon choice. I found that in most sets of proteins, there is an greater amount of correlation than would be expected by chance, this crucially suggests that there is additional structural information contained within the mRNA that is lost once translation occurs.

For additional information on the background, methodology and implementation of this research, please feel free to view the presentation slides at:  http://www.slideshare.net/AlistairMartin/evolut-26981404

# Protein Interaction Networks

Proteins don’t just work in isolation, they form complex cliques and partnerships while some particularly gregarious proteins take multiple partners. It’s becoming increasingly apparent that in order to better understand a system, it’s insufficient to understand its component parts in isolation, especially if the simplest cog in the works end up being part of system like this.

On a macroscopic scale, a cell doesn’t care if the glucose it needs comes from lactose, converted by lactase into galactose and glucose, or from starch converted by amalase, or from glycogen, or from amino acids converted by gluconeogenesis. All it cares about is the glucose. If one of these multiple pathways should become unavailable, as long as the output is the same (glucose) the cell can continue to function. At a lower level, by forming networks of cooperating proteins, these increase a system’s robustness to change. The internal workings may be rewired, but many systems don’t care where their raw materials come from, just so long as they get them.

Whilst sequence similarity and homology modelling can explain the structure and function of an individual protein, its role in the greater scheme of things may still be in question. By modelling interaction networks, higher level questions can be asked such as: ‘What does this newly discovered complex do’? – ‘I don’t know, but yeast’s got something that looks quite like it.’ Homology modelling therefore isn’t just for single proteins.

Scoring the similarity of proteins in two species can be done using many non-exclusive metrics including:

• Sequence Similarity – Is this significantly similar to another protein?
• Gene Ontology – What does it do?
• Interaction Partners – What other proteins does this one hang around with?

• Subsequently clustering these proteins based on their interaction partners, highlights the groups of proteins which form functional units. These are highly connected internally whilst having few edges to adjacent clusters. This can provide insight into previously un-investigated proteins which by virtue of being in a cluster of known purpose, their function can be inferred.

# Antibody Modelling: CDR-H3 Structure Prediction

As regular readers of this blog will know (I know you’re out there somewhere!), one of the main focusses of OPIG at the moment is antibody structure. For the last ten weeks (as one of my short projects for the Systems Approaches to Biomedical Science program of the DTC) I have been working on predicting the structure of the CDR-H3 loop.

So, a quick reminder on antibody structure: antibodies, which have a characteristic shape reminiscent of the letter `Y’, consist of two identical halves, each containing a heavy and a light chain. Heavy chains are made up of four domains (three constant domains, CH1, CH2 and CH3; and one variable domain, VH), while light chains have two (one constant domain, CL; and one variable domain, VL). The variable domains of both the heavy and light chain together are known as the Fv region; most naturally occurring antibodies have two. At the ends of these Fv regions are six loops, known as the complementarity determining regions, or CDRs. There are three CDRs on each of the VH and VL domains; those located on the VL domain are labelled L1, L2 and L3, while those found on the VH domain are labelled H1, H2 and H3. It is these loops that form the most variable parts of the whole antibody structure, and so it is these CDRs that govern the binding properties of the antibody. Of the six CDRs, by far the most variable is the H3 loop, found in the centre of the antigen binding site. A huge range of H3 lengths have been observed, commonly between 3 and 25 residues but occasionally much longer. This creates a much larger structural diversity when compared to the other CDRs, each of which has at most 8 different lengths. It is the H3 loop that is thought to contribute the most to antigen binding properties. Being able to model this loop is therefore an important part of creating an accurate model, suitable for use in therapeutic antibody design.

Predicting the structure of the loop requires three steps: sampling, filtering and ranking. There are two types of loop modelling method, which differ in the way they perform the sampling step: knowledge-based methods, and ab initio methods. Knowledge-based methods, or database methods, rely upon databases of known loop structures that can be searched in order to find fragments that would form feasible structures when placed in the gap. Whilst predictions are made relatively quickly in this way, one disadvantage is that the database of fragments may not contain anything suitable, and in this situation no prediction would be made. Ab initio (or conformational searching) methods, on the other hand, do not rely upon a set of previously known loop structures – loop conformations are generated computationally, normally by sampling dihedral angles from distributions specific to each amino acid. The loops generated in this way, however, are not ‘closed’, i.e. the loop does not attach to both anchor regions, and therefore some sort of loop closure method must be implemented. The assumption is made that the native loop structure should represent the global minimum of the protein’s free energy. Ab initio methods are generally much slower than knowledge-based ones, and their accuracy is dependent on loop length (long loops are harder to predict using this method), however unlike the database methods, an answer will always be produced.

For my project, I have examined the performance of FREAD (a knowledge-based method) and MECHANO (an ab initio method) when predicting the structure of the H3 loop. At the moment, FREAD produces better results than MECHANO, however we hope to improve the predictions made by both. By optimising the performance of both methods, we hope to create a ‘hybrid’ loop modelling method, thereby exploiting the advantages of both approaches. Since I’ve decided that this is the project I want to continue with, this will be the aim of my DPhil!

# Get PDB intermolecular protein contacts and interface residues

Very often in Struc Bio it is necessary to determine the contacts between two molecules. Most of us in the group have written a snippet of code to compute precisely that or they have adapted the Biopython functionality or one of the tools in pdbtools. The piece of code written in Python presented here is a Biopython variety that gives you the intermolecular contacts and it annotates the interface neighborhood. An example of the program output is given in the Figure below:

The complex between an antibody and an antigen is shown on the left without the annotation. On the right it is shown with intermolecular contacts annotated in red (4.5A distance) and the interface neighborhood shown in green (10A away from any contact residue).

Download

You can download it from here. There are three files inside:

1. GetInterfaces.py – the main source/runnable file
2. README.txt – instructions, very similar to this post (quite a lot copy/pasted)
3. 1A2Y.pdb – the PDB file used in the example to practice on.

Requirements:

You need Biopython. (if you are from OPIG or any other Bioinformatics group, most likely it is already installed on your machine). You can download it from here.

How to use it?

As the bare minimum, you need to provide the structure of the pdb(s) and the chains that you want to examine contacts in.

Input options:

• –f1 : first pdb file [Required]
• –f2 : second pdb file (if the contacts are to be calculated in the same molecule, just submit the same pdb in both cases) [Required]
• –c1 : Chains to be used for the first molecule [Required]
• –c2 : Chains to be used for the second molecule [Required]
• –c : contact cutoff for intermolecular contacts (Optional, set to 4.5A if not supplied on input)
• –i : interface neighbor cutoff for intramolecular neighborhood of the contacting interface (Optional, set to 10.0A if not supplied on input). Set this option to zero (0.0) if you only want to get the intermolecular contacts in the interface, without the interface neighborhood.
• –jobid : name for the output folder (Set to out_<random number> if not supplied on input)

An example which you can just copy paste and run when in the same directory as the python script:

`python GetInterfaces.py --f1 1A2Y.pdb --f2 1A2Y.pdb --c1 AB --c2 C --c 4.5 --i 10.0 --jobid example_output`

Above command will calculate the contacts between antibody in 1a2y (chains A and B) and the antigen (chain C). The contact distance was defined as 4.5A and the interface distance was defined as 10A. All the output files are saved in the folder out_example_output.

Output

Output folder is placed in the current directory. If you specify the output folder name (–jobid) it will be saved under the name ‘out_[whateveryoutyped]’, otherwise it will be ‘out_[randomgeneratednumber]’. The program tells you at the end where it saved all the files.

Input options:

• molecule_1.pdb – the first supplied molecule with b-factor fields of contacts set to 100 and interface neighborhood set to 50
• molecule_2.pdb – the second supplied molecule with b-factor fields of contacts set to 100 and interface neighborhood set to 50
• molecule_1.txt – whitespace delimited file specifying the contacts and interface neighborhood in the second molecule in the format: [chain] [residue id] [contact ‘C’ or interface residues ‘I’]
• molecule_2.txt – whitespace delimited file specifying the contacts and interface neighborhood in the second molecule in the format: [chain] [residue id] [contact ‘C’ or interface residues ‘I’]
• molecule_1_constrained.pdb – the first molecule, which is constrained only to the residues in the interface.
• molecule_2_constrained.pdb – the second molecule, which is constrained only to the residues in the interface.
• parameters.txt – the contact distance and neighborhood distance used for the particular run.

# Network Analysis

Why networks?

Individual expression could be thought as a phenomenon regulated mostly by the individual, but in a second stand it is also modified by the interactions with the surroundings.  Can the response of the individual be predicted by the group? (See the following video of an experiment conducted by Asch https://www.youtube.com/watch?v=FnT2FcuZaYI)

Most common type of network analysis

• Basic network summary statistics (description)
• Clustering methods (Extract information)
• Random graphs (Description, inference and to model network topology)
• Learning machine methods (Prediction)

Random Graphs and the topology structure

Depending on the structure of a desired network different random models could be of use, for example, if the goal is to obtain a sparse and not highly connected network then an ER model could be of use (this model randomly assign the edges between nodes)
or if the goal is exactly the opposite (have a very highly connected network) a geometric graph could be of use (this model randomly assign positions in a n-dimensional space and then place edges between nodes closer than a given distance).

Is there already a random model?

According to our recent results we suspect there is no null model yet for PPIs, even though  for some virus PPIs some of the random models seem to be very good models; however this virus PPIs are much smaller (around 300 nodes and up to 500 edges) than the networks of model organisms (usually with more than 2000 nodes and 5000 edges) such as yeast, human, fruit fly and Escherichia coli among others.
We will soon be publishing our article with details about this.