Monthly Archives: March 2013

A javascript function to validate FASTA sequences

I was more than a bit annoyed of not finding this out there in the interwebs, being a strong encourager of googling (or in Jamie’s case duck-duck-going) and re-use.

So I proffer my very own fasta validation javascript function.

 * Validates (true/false) a single fasta sequence string
 * param   fasta    the string containing a putative single fasta sequence
 * returns boolean  true if string contains single fasta sequence, false 
 *                  otherwise 
function validateFasta(fasta) {

	if (!fasta) { // check there is something first of all
		return false;

	// immediately remove trailing spaces
	fasta = fasta.trim();

	// split on newlines... 
	var lines = fasta.split('\n');

	// check for header
	if (fasta[0] == '>') {
		// remove one line, starting at the first position
		lines.splice(0, 1);

	// join the array back into a single string without newlines and 
	// trailing or leading spaces
	fasta = lines.join('').trim();

	if (!fasta) { // is it empty whatever we collected ? re-check not efficient 
		return false;

	// note that the empty string is caught above
	// allow for Selenocysteine (U)
	return /^[ACDEFGHIKLMNPQRSTUVWY\s]+$/i.test(fasta);

Let me know, by comments below, if you spot a no-no.  Please link to this post if you find use for it.

p.s. I already noticed that this only validates one sequence.  This is because this function is taken out of one of our web servers, Memoir, which specifically only requires one sequence.  If there is interested for multi sequence validation I will add it.


Journal Club: Can Linear Progamming (LP) be useful to us?

Linear programming (LP) is known as a fast and powerful computational technique. It has been applied to a large range of problems in finances and economics, but it is not very popular among us bioinformaticians, computational biologists, and the likes.


Linear Programming is all about find feasible solutions that satisfy a series of constraints (usually represented by inequalities). Does it sound like a familiar problem to bioinformaticians and computational biologists out there?


This leaves room for some questioning: can biological phenomena be modelled or simplified under the assumption of linearity? Furthermore, can LP be used to tackle the many difficult problems posed in our field? Perhaps an even bigger question: why would any of us use Linear Programming instead of another type of linear modelling? What are the advantages of it?

I will not incur in explaining the particulars of LP here. There is a plethora of materials available online (Wikipedia and Wolfram are accessible starting points) that detail Linear Programming. For those eager for something more substantial, V. Chvatal’s Linear Programming and Dantzig’s Linear Programming and Extensions are two good texts on the subject.

During this week’s journal club, I discussed an article that attempted to use Linear Programming to devise knowledge-based Docking Potentials (DP) tailored for transient protein-protein complexes. Transient complexes tend to be under-represented on the PDB, mostly due to the inherent difficulties of crystallizing such complexes. Hence, the development of knowledge-based potentials for these special cases of protein interaction is drastically hindered by a sample size limitation.

Source: Bizzarri AR, Brunori E, Bonanni B, Cannistraro S. Docking and molecular dynamics simulation of the Azurin–Cytochrome c551 electron transfer complex. J. Mol. Recognit. 2007; 20: 122–131

A cartoon representation of a transient complex between Azurin (cyan) and its partner Cytochrome C551 (dark blue) from Pseudomonas aeruginosa. Transient protein complexes are hard to crystallize, hence, under-represented on the PDB.

Source: Bizzarri AR, Brunori E, Bonanni B, Cannistraro S. Docking and molecular dynamics simulation of the Azurin–Cytochrome c551 electron transfer complex. J. Mol. Recognit. 2007; 20: 122–131

To offset such limitation, it would be ideal if one could extrapolate information from decoys (non-native conformations obtained from computational docking tools) in order to improve the Docking potentials. Furthermore, in an ideal world, one would also address the bias introduced by homology/sequence similarity between the existing proteins in the available structures of transient complexes.

The author of the article “Designing coarse grained-and atom based-potentials for protein-protein docking – Tobi D. – BMC Structural Biology 2010, 10:40 doi:10.1186/1472-6807-10-40 ” claims that LP can address such issues by incorporating information from the decoys as linear constraints to the model. The article describes a linear problem, in which the aim is to minimize the variance of how much the non-native energy potentials differ from the native ones. Also, they impose the constraints that native structures must have a lower energy than all of the non-native structures for a given complex (lower in this case is good).

The energy is defined as a weighted sum of the counts of specific interaction types on the complex interface. In their work, the author employed two models: an atom-based model and a side chain-based model. These models are used to classify atoms into groups and to simplify calculations. Initially, they define boolean (one-step) interactions: two atoms interact if they are within a cutoff distance of each other. This cutoff varies according to the type of atoms involved. The initial model led to a state of infeasibility, and it was then replaced by a two-step model, where you have strong and weak interactions and two sets of cutoff (this leads to twice as many unknowns in the LP model).

Well, does it work? How does it fair against other existing knowledge-based DPs?

Source: Designing coarse grained-and atom based-potentials for protein-protein docking. - Tobi D. - BMC Structural Biology 2010, 10:40 doi:10.1186/1472-6807-10-40Source: Designing coarse grained-and atom based-potentials for protein-protein docking. – Tobi D. – BMC Structural Biology 2010, 10:40 doi:10.1186/1472-6807-10-40

Despite the lack of brilliant results or any apparent improvement compared to the state-of-art, the potentials described in the article seem to slightly outperform ZDOCK2.3’s scoring functions.

This may actually speak in favour of the applicability of LP to problems in our area. In the case presented during the journal club, an LP approach produced comparable results to more conventional techniques.

Perhaps the best answer to “why should I use LP?” is that it is an unconventional, creative solution. It is significantly fast and, therefore, easy to try out depending on your problem. Science is all about experimentation, after all. Why would you not try a different technique if you have the means to?

Image Source:

The moral of the story: it is good to think outside the box, as long as you keep your feet on the ground.

Image Source:

Check the article discussed in the post here.

Journal Club: Protein structure model refinement using fragment-guided MD

For this week’s journal club I presented this paper by Jian et al from the Zhang Lab. The paper tackles the problem of refining protein structure models using molecular dynamics (MD).

The most successful protocols for building protein structure models have been template-based methods. These involve selecting whole or parts of known protein structures and assembling them to form an initial model of the target sequence of interest. This initial model can then be altered or refined to (hopefully) achieve higher accuracy. One method to make these adjustments is to use molecular dynamics simulations to sample different conformations of the structure. A “refined” model can then be taken as a low-energy state that the simulation converges to. However, whilst physics-based potentials are effective for certain aspects of refinement (e.g. relieving clashes between side chain atoms), the task of actually improving overall model quality has so far proved to be too ambitious. 

The Method

In this paper entitled “Atomic-Level Protein Structure Refinement Using Fragment-Guided Molecular Dynamics Conformation Sampling,” Jian et al demonstrate that current MD refinement methods have little guarantee of making any improvement to the accuracy of the model. They therefore introduce a technique of supplementing physics-based potentials with knowledge about fragments of structures that are similar to the protein of interest.

The method works by using the initial model to search for similar structures within the PDB.  These are found using two regimes. The first is to search for global templates by assessing the TMscore of structures to the whole initial model. The second is to search for fragments of structures by dividing the initial model into continuous 3 secondary structure elements. From these sets of templates and the initial model, the authors can generate a bespoke potential for the model based on the distances between Cα atoms. By doing this, additional information about the likely global topology of the protein can be incorporated into a molecular dynamics simulation. The authors claim that this enables the MD energy landscape is therefore reshaped from being “golf-course-like” being “funnel-like”.  Essentially, the MD simulations are guided to sample conformations which are likely (as informed by the fragments) to be close to the target protein structure. 


A schematic of the FG-MD refinement procedure

 Does it work?

As a full solution to the problem of protein structure model refinement, the results are far from convincing. Quality measures show improvement in only the second or third decimal place from the initial model to the refined model. Also, as might be expected, the degree to which the model quality is improved is dependent on the accuracy of the initial of the model.

However, what is important about this paper is that, although small, the improvements made do exist in a systematic fashion. Previously, attempts to refine a model using MD not only failed to improve its accuracy but would be likely to reduce its quality. Fragment-guided MD (FG-MD) and the explicit inclusion of a hydrogen bonding potential, is not only able to improve the conformations of side chains but also improve (or at least not destroy) the global backbone topology of a model.

Dependence of the energy of  a model on its TM-Score to the native structure. In black is the energy as measure using the AMBER99 energy function. In grey is the corresponding funnel-like shape of the FG-MD energy function.

Dependence of the energy of a model on its TM-Score to the native structure. In black is the energy as measure using the AMBER99 energy function. In grey is the corresponding funnel-like shape of the FG-MD energy function.

This paper therefore lays the groundwork for the development of further refinement methods that incorporate the knowledge from protein structure fragments with atomically detailed energy functions. Given that the success of the method is related to the accuracy of the initial model, there may be scope for developing similar techniques to refine models of specific proteins where modelling quality is already good. e.g. antibodies.






aRrrrgh! or how to apply a fitted model to new data

Recently I’ve been battling furiously with R while analysing some loop modelling accuracy data. The idea was simple:

  1. Fit a general linear model to some data
  2. Get out a formula to predict a variable (let’s call it “accuracy”) based on some input parameters
  3. Apply this formula to new data and see how well the predictor does

It turns out, it’s not that simple to actually implement. Fitting a general linear model in R produces coefficients in a vector.

model <- glm(accuracy ~ param1 + param2 * param3, data=trainingset)
            (Intercept)                  param1                  param2 
            0.435395087            -0.093295388             0.148154339 
                 param3           param2:param3
            0.024399530             0.021100300

There seems to be no easy way to insert these coefficients into your formula and apply the resulting equation to new data. The only easy thing to do is to plot the fitted values against the variable we’re trying to predict, i.e. plot our predictions on the training set itself:

plot(model$fitted.values, trainingset$accuracy, xlab="score", ylab="accuracy", main="training set")

I’m sure there must be a better way of doing this, but many hours of Googling led me nowhere. So here is how I did it. I ended up writing my own parser function, which works only on very simple formulae using the + and * operators and without any R code inside the formula.

coefapply <- function(coefficients, row)
  result <- 0
  for (i in 1:length(coefficients))
    subresult <- as.numeric(coefficients[i])
    if (!
      name <- names(coefficients[i])
      if (name != "(Intercept)")
        subnames <- strsplit(name, ":", fixed=TRUE)[[1]]
        for (n in subnames)
          subresult <- subresult * as.numeric(row[n])
      result <- result + subresult

calculate_scores <- function(data, coefficients)
  scores <- vector(mode="numeric", length=nrow(data))
  for (i in 1:nrow(data))
    row <- data[i,]
    scores[i] <- coefapply(coefficients, row)

Now we can apply our formula to a new dataset and plot the accuracy achieved on the new data:

model_coef <- coef(model)

# Test if our scores are the same values as the model's fitted values
training_scores <- calculate_scores(model_coef, trainingset)
sum((training_scores - model$fitted.values) < 0.000000000001) / length(scores)

# Calculate scores for our test set and plot them
test_scores <- calculate_scores(model_coef, testset)
plot(test_scores, testset$accuracy, xlab="score", ylab="accuracy", main="test set")

It works for my purpose. Maybe one day someone will see this post, chuckle, and then enlighten me with their perfectly simple and elegant alternative.

On being cool: arrows on an R plot

Recently I needed a schematic graph of traditional vs on-demand computing (don’t ask) – and in this hand waving setting I just wanted the axes to show arrows and no labels.  So, here it is:

x <- c(1:5)
y <- rnorm(5)
plot(x, y, axes = FALSE)
u <- par("usr") 
arrows(u[1], u[3], u[2], u[3], code = 2, xpd = TRUE) 
arrows(u[1], u[3], u[1], u[4], code = 2, xpd = TRUE)

And here is the output

Arrowed plot

(I pinched this off a mailing list post, so this is my due reference)

Next thing I am toying with are these xkcd like graphs in R here.

Journal club: Improving the selectivity of antimicrobial peptides

Last wednesday I talked about this paper which is about antimicrobial peptides. First I should say even though many of the authors are French this particular group was unknown to me and it was only the subject tackled which caught my eye – no patriotism, as it should become obvious in the following paragraphs. More precisely, being a bit of an outlier in OPIG (as I spend all my time in the biochemistry building doing Molecular Dynamics (MD) simulations) I was curious to get feedback from OPIG members on the sequence processing method developed in this article.

So what are antimicrobial peptides (AMP) and what do we want to do with them?

These are peptides able to kill bacteria in a rather unspecific manner: instead of targeting a particular membrane receptor AMP appear to lyse bacterial membrane via poration mechanisms which are not totally understood but relies on electrostatic interactions and lead to cell leakage and death. This mode of action means it is difficult for bacteria to develop resistance (they can’t change their entire membrane) and AMP are extensively studied as promising antibiotics. However, the therapeutic potential of AMP is severely limited by the fact that they are also often toxic to eukaryotic cells – this is the downside of their rather unspecific membrane lytic activity.

If AMP are to become useful antibiotic it is therefore important to maximise their activity against bacterial cells while limiting it against eukaryotic cells. This is precisely the problem tackled by this paper.

The ratio between the desired and not desired activity of AMP can be measured by the therapeutic index (TI) which is defined in this paper as:

TI = HC50 / MIC

where HC50 is the AMP concentration needed to kill 50% of red blood cells and MIC is the minimal inhibitory concentration against bacteria. The higher the TI the better the AMP for therapeutic use and the authors presents a method to identify mutations (single or double) in order to increase the TI of a given sequence.

What did the authors do?

Their method relies on a new way to describe AMP properties, which the authors called the sequence moment and which was built to reflect the asymmetry along the peptide chain regarding the property of interest. The rationale underlying this is that there is a functional asymmetry in the peptide chain, with the N terminal region being “more important for activity and selectivity of AMPs” – this claim is based on a reference from 2000 in which it is shown that in the peptides studied mutations in the N-terminus region are more likely than that in the C-terminus region to result in a decrease in AMP activity.

The derivation of the sequence moment is interesting and illustrated in the figure below. The AMP sequence is projected on a 90 degrees arc, with the N terminus on the y axis. The metric of interest is then plotted for each residue as a vector (tiny arrows in the figure) with the same origin as the arc and the length and direction of which depends on the metric value and the orientation. The mean value of the metric for the AMP sequence is then obtained by summing the positional vectors together (big arrows).

With this method the authors are able to get a mean value for the metric of interest which also contains information on the distribution along the AMP sequence of the metric of interest.


Interestingly, the descriptor used by the authors is actually the angle difference between the arrows obtained when calculating the sequence moment using two different hydrophobicity scales (Janin’s and Guy’s) – see red and blue arrows in the figure above.

The authors claim that the cosine of this angle, termed descriptor D, correlates well with the TI of AMP according to the following linear relation:

TI = 50.1 – 44.8*D

Long story short the authors then describe how they implemented a software, called Mutator and available here, which takes a sequence as input and based on a set of 26 best AMP (defined as having a TI > 20 and less than 70% sequence homology between them) suggest single or double mutations to improve its TI based on the method and relationship above. They then test their predictions experimentally for 2 peptides (Ascaphin-8 and XT-7) and it appears that the analogues suggested by the Mutator do have a much higher TI than their parent sequence.

What do we think about it?

Although this paper presents a method which seemingly gives impressive results I have 2 major problems with it.

The first one would be that the D descriptor has no rationale nor clear signification which makes interpreting the results difficult. Which sequence property does this descriptor capture? At best it can be said that when D is small the two hydrophobicity scales differ widely for the sequence studied (arrows at a 90 degrees angle), whereas they agree when D is close to 1. It would then be necessary to go back to how the scales were derived to understand what is being picked-up here, if anything.

Second, there is no proper statistical analysis of the significance of the results obtained. As noted by OPIG members the peptides studied are all fairly similar,and the less than 70% pairwise identity rule used for the training does not guarantee much diversity. Essentially the algorithm is thus trying to make sequences which are not very different from the training set into even more so similar sequences and the type of residues mutated to achieve is limited due to the fact AMPs are rather short sequences and enriched in particular residues. Therefore one might be able to argue that any such mutation is rather likely to lead to an improvement of the TI – especially if the input sequence is chosen specifically for not being optimised for the desired activity (broad spectrum AMP). It would be important to design a proper null hypothesis control and measure experimentally whether the TI index of the analogues obtained are statistically significantly lower than those obtained with the Mutator software.

In summary it might sound a bit harsh but my personal opinion is that this paper is the kind of paper people who run Virtual Screening like a black box (see JP’s excellent previous post) will enjoy. Copy and paste a sequence, hit run and it seemingly gives you result. If you look hard enough for “a” descriptor that “correlate” with your data you will find one – especially if A) you don’t define what “to correlate” means, and B) your descriptor doesn’t have to mean anything. So the trouble is no one has the beginning of a clue about what is going on and , what is worse, before being able to think about it it would be first necessary to run proper controls in order to assess whether anything is actually going on.

A free, sweet, valid HTML4 “Site Maintenance” page

So today we have moved servers from the cloud to a physically local server and we needed a “Site Maintenance” page.  A few google searches turned up a simple HTML5 template which I converted to HTML4 and is reproduced hereunder (could not find the original source, aargh):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"


    <meta http-equiv="Content-type" content="text/html;charset=UTF-8">
    <title>Site Maintenance</title>
    <style type="text/css">
      body { text-align: center; padding: 150px; }
      h1 { font-size: 50px; }
      body { font: 20px Helvetica, sans-serif; color: #333; }
      #article { display: block; text-align: left; width: 650px; margin: 0 auto; }
      a { color: #dc8100; text-decoration: none; }
      a:hover { color: #333; text-decoration: none; }

    <div id="article">
    <h1>We&rsquo;ll be back soon!</h1>
        <p>Sorry for the inconvenience but we&rsquo;re performing some maintenance at the moment. If you need to you can always contact us on <b>opig AT</b>, otherwise we&rsquo;ll be back online shortly!  Site should be back up on Friday 1st March 2013, 16:00 GMT.</p>
        <p>&mdash; OPIG</p>

And here is what it looks like… (nothing glamorous, you have been warned)