Last wednesday I talked about this paper which is about antimicrobial peptides. First I should say even though many of the authors are French this particular group was unknown to me and it was only the subject tackled which caught my eye – no patriotism, as it should become obvious in the following paragraphs. More precisely, being a bit of an outlier in OPIG (as I spend all my time in the biochemistry building doing Molecular Dynamics (MD) simulations) I was curious to get feedback from OPIG members on the sequence processing method developed in this article.
So what are antimicrobial peptides (AMP) and what do we want to do with them?
These are peptides able to kill bacteria in a rather unspecific manner: instead of targeting a particular membrane receptor AMP appear to lyse bacterial membrane via poration mechanisms which are not totally understood but relies on electrostatic interactions and lead to cell leakage and death. This mode of action means it is difficult for bacteria to develop resistance (they can’t change their entire membrane) and AMP are extensively studied as promising antibiotics. However, the therapeutic potential of AMP is severely limited by the fact that they are also often toxic to eukaryotic cells – this is the downside of their rather unspecific membrane lytic activity.
If AMP are to become useful antibiotic it is therefore important to maximise their activity against bacterial cells while limiting it against eukaryotic cells. This is precisely the problem tackled by this paper.
The ratio between the desired and not desired activity of AMP can be measured by the therapeutic index (TI) which is defined in this paper as:
TI = HC50 / MIC
where HC50 is the AMP concentration needed to kill 50% of red blood cells and MIC is the minimal inhibitory concentration against bacteria. The higher the TI the better the AMP for therapeutic use and the authors presents a method to identify mutations (single or double) in order to increase the TI of a given sequence.
What did the authors do?
Their method relies on a new way to describe AMP properties, which the authors called the sequence moment and which was built to reflect the asymmetry along the peptide chain regarding the property of interest. The rationale underlying this is that there is a functional asymmetry in the peptide chain, with the N terminal region being “more important for activity and selectivity of AMPs” – this claim is based on a reference from 2000 in which it is shown that in the peptides studied mutations in the N-terminus region are more likely than that in the C-terminus region to result in a decrease in AMP activity.
The derivation of the sequence moment is interesting and illustrated in the figure below. The AMP sequence is projected on a 90 degrees arc, with the N terminus on the y axis. The metric of interest is then plotted for each residue as a vector (tiny arrows in the figure) with the same origin as the arc and the length and direction of which depends on the metric value and the orientation. The mean value of the metric for the AMP sequence is then obtained by summing the positional vectors together (big arrows).
With this method the authors are able to get a mean value for the metric of interest which also contains information on the distribution along the AMP sequence of the metric of interest.
Interestingly, the descriptor used by the authors is actually the angle difference between the arrows obtained when calculating the sequence moment using two different hydrophobicity scales (Janin’s and Guy’s) – see red and blue arrows in the figure above.
The authors claim that the cosine of this angle, termed descriptor D, correlates well with the TI of AMP according to the following linear relation:
TI = 50.1 – 44.8*D
Long story short the authors then describe how they implemented a software, called Mutator and available here, which takes a sequence as input and based on a set of 26 best AMP (defined as having a TI > 20 and less than 70% sequence homology between them) suggest single or double mutations to improve its TI based on the method and relationship above. They then test their predictions experimentally for 2 peptides (Ascaphin-8 and XT-7) and it appears that the analogues suggested by the Mutator do have a much higher TI than their parent sequence.
What do we think about it?
Although this paper presents a method which seemingly gives impressive results I have 2 major problems with it.
The first one would be that the D descriptor has no rationale nor clear signification which makes interpreting the results difficult. Which sequence property does this descriptor capture? At best it can be said that when D is small the two hydrophobicity scales differ widely for the sequence studied (arrows at a 90 degrees angle), whereas they agree when D is close to 1. It would then be necessary to go back to how the scales were derived to understand what is being picked-up here, if anything.
Second, there is no proper statistical analysis of the significance of the results obtained. As noted by OPIG members the peptides studied are all fairly similar,and the less than 70% pairwise identity rule used for the training does not guarantee much diversity. Essentially the algorithm is thus trying to make sequences which are not very different from the training set into even more so similar sequences and the type of residues mutated to achieve is limited due to the fact AMPs are rather short sequences and enriched in particular residues. Therefore one might be able to argue that any such mutation is rather likely to lead to an improvement of the TI – especially if the input sequence is chosen specifically for not being optimised for the desired activity (broad spectrum AMP). It would be important to design a proper null hypothesis control and measure experimentally whether the TI index of the analogues obtained are statistically significantly lower than those obtained with the Mutator software.
In summary it might sound a bit harsh but my personal opinion is that this paper is the kind of paper people who run Virtual Screening like a black box (see JP’s excellent previous post) will enjoy. Copy and paste a sequence, hit run and it seemingly gives you result. If you look hard enough for “a” descriptor that “correlate” with your data you will find one – especially if A) you don’t define what “to correlate” means, and B) your descriptor doesn’t have to mean anything. So the trouble is no one has the beginning of a clue about what is going on and , what is worse, before being able to think about it it would be first necessary to run proper controls in order to assess whether anything is actually going on.