# Recognizing pitfalls in Virtual Screening: A critical review

So, my turn to present at Group Meeting Wednesdays – and I decided to go for the work by Scior et al. about pitfalls in virtual screening.  As a general comment, I think this paper is well written and tackles most of the practical problems when running a virtual screening (VS) exercise.  Anyone, who intends to either develop a method in this field or else is planning to run a virtual screening  exercise should read it.  I’ve often heard the phrase “virtual screening doesn’t work”, and that comes almost exclusively from people who run computational experiments as a black box, without understanding what is going on and by accepting all defaults for a specific protocol.  This paper highlights what to watch out for.  Of the author list, I’ve only met with Andreas Bender once at an MGMS meeting a few years back – his main PhD work was on molecular similarity.

The article describes pitfalls associated with four areas; expectations and assumptions; data design and content; choice of software; and conformational sampling as well as ligand and target flexibility.  The authors start off by arguing that the expectations are too high; people just run a VS experiment and expect to find a potent drug.  But this a rare occurrence indeed.  Below is a set of notes for their main points.

Erroneous assumptions and expectations

1. High expectations: main goal is to identify novel bioactive chemical matter for the particular target of interest.  Highly potent compounds desirable but not required.  Expectations too high.  Lead; single digit µM and Hit; < 25µM
2. Stringency of queries: strict vs loose search criteria.  Strict; no diversity, few good results returned.  Loose; returns many false positives.   Removal of one feature at a time – a kind of pharmacophoric feature bootstrapping which highlights which features are important.
3. Difficulty in binding pose prediction.  Taken from their reference [46], “For big (> 350 Daltons) ligands, however, there is currently very little evidence. We hope investigators will come forward with crystallographic confirmation of docking predictions for higher molecular weight compounds to shed more light on this important problem.”  This point is really interesting and tools, such as Gold, even have an FAQ entry which addresses this question.
4. Water: hydrogen bonds are mediated by water which are often visible in the crystal structure.  Hard to predict exact number, position and orientation.  Realism of model at a cost of computational resources.
5. Single vs. multiple/allosteric binding pockets: Sometimes binding site is not known yet we always assume that ligand binds to one specific place.
6. Subjectivity of Post VS compound:  The result of a VS experiment is a ranking list of the whole screening database.  Taking top N results in very similar compounds – so some sort of post processing is usually carried out (e.g. clustering, but even a subjective manual filtering).  Difficult to reproduce across studies.
7. Prospective validation: the benchmarking of VS algorithms is done retrospectively.  Test on an active/decoy set for a particular target.  Only putative inactives.  Rarely validated in an external prospective context.
8. Drug-likeness: most VS experiments based on Lipinski Ro5 – not more than 5 hydrogen bond donors (nitrogen or oxygen atoms with one or more hydrogen atoms), not more than 10 hydrogen bond acceptors (nitrogen or oxygen atoms), a molecular mass less than 500 daltons, An octanol-water partition coefficient log P not greater than 5.  But these apply to oral bioavailability.  Lots of drugs fall out of this scope; intravenous drugs, antibiotic, peptidic drugs.  VS validated on Lipinski space molecules.
9. Diversity of benchmark library vs diversity of future , prospective vs runs: Library must fit the purpose of the experiment.  Most VS validation experiments are on commercially available libraries ~ small fraction of chemical space.  Type of screening library must be closely related to objective of VS campaign.   Results have to be transferable between runs. Validation on specific target family? If the goal is lead optimization combinatorial libraries are attractive.  Natural versus synthesized compounds; different chemical space.

Data design and content

1. Incomparability of benchmark sets: some datasets for docking studies, others for ligand based VS – incomparable methods.  In general 2D methods better than 3D (surprising!).  In 2D methods fingerprints outperform 3D methods.  Same datasets for validation of different methods – hard to reproduce any study otherwise.
2. Limited comparability of performance metrics: Tag along on previous point; performance measurement used for different measurements should be the same. Mean EF risky because of ratio between actives to inactive molecules. ROC curves a problem because of early and late performance – use of BedROC (different importance to early and late stages of retrieved list of compounds). EF = (number of actives / number of expected) for a given % of the database
3. Hit rate in benchmark data sets; small libraries not good enough. Typical VS hit rates ~0.01% – 0.14%. Analogue bias; actives all look very similar to each other. Artificial enrichment; easy to tell between actives and decoys. Recent study found that for ligand based VS using no. of atoms gives half the VS performance.
4. Assay Comparability and Technology: Properly designed datasets such as MUV, use of similarity distributions to remove anything very similar to each other.  Remove problematic molecules like autofluoroscence .  MUV uses data from pubchem; different bioassays from different groups hence different quality.  Choices of targets; cutoffs; parameters; etc.  “Ideal VS benchmark deck will never happen.”
5. Bad molecules as actives: No real activity but either reactive or aggregating molecules in the assay which gives up a false positive; PAINs Pan assay interfering substances or frequent hitters. Small number of actives compared to inactives, false positives worse than false negatives.
6. Putative inactive compounds as decoys. The decoys are actually actives.
7. Feature weights: LBVS based on a single query fails to identify important parts of the the molecule, e.g. benzamidine warhead in factor Xa inhibitors

Choice of Software

1. Interconverting chemical formats; errors or format incompatibilities.  Information lost or altered; or when using same format across different software (e.g. chirality, hybridization, and protonation states).
2. Molecule preparation; query molecules must be preprocessed exactly the same way as the structures in the database being screened to ensure consistency (e.g. partial charge calculation)
3. Feature definition: Specific rules which are sometimes left out of pharmacophoric definition. e.g. O, N in oxazole do not both behave as a HBA. Watch out for tautomers, protonation state, and chirality
4. Fingerprint selection and algorithmic implementation: different implementations of same fingerprint MACCS result in different fingerprints. Choice of descriptors; which ones to pick? Neighbourhood? Substructure?
5. Partial charges: Mesomeric effects; formal +1 charge spread over guanidine structure.
6. Single predictors versus ensembles: no single method works best in all cases. Consensus study; apply multiple methods and combine results.

Conformational sampling as well as ligand and target flexibility

1. Conformational coverage: four main parameters: (i) sampling algorithms and their specific parameters; (ii) strain energy cutoffs (iii) maximum number of conformations per molecule (iv) clustering to remove duplicates
2. Defining bioactive conformations: most ligands have never been co-crystallized with their primary targets and even fewer have been cocrystallized with counter targets. Same ligand might bind to different proteins in vastly different conformations. How easy is it to reproduce the cognate conformation? Also ligand changes shape upon binding. Minimum energy conformations are a common surrogate.
3. Comparing conformations: definitions of identity thresholds. 0.1 < rmsd < 0.5 excellent; 0.5 < rmsd < 1.0 good fit; 1.0 < rmsd < 1.5 acceptable; 1.5 < rmsd < 2.0 less acceptable; >2.0 not a fit in terms of biological terms. All atoms vs fragments RMSD makes direct comparison hard.
4. Size of conformational ensemble; trade off between computational cost and sampling breadth. Conformer generator may not generate bioactive conformation. How many conformations required to have bioactive one. Many bioactive conformations might exist.
5. Ligand flexibility – hard upper limit for no. of conformations. Conformer sizes depend mostly on number of rotatable bonds. Conformer generation tools don’t work well on some classes of molecules e.g. macrocycles
6. High energy conformations – high energy conformers (or physically unrealistic molecules; e.g. a cis secondary amide) detrimental to VS experiments. 3D pharmacophore searches sometimes result in matching strained structure; but 70% of ligands bind at strain energies below 3kcal/mol (stringent).
7. Target flexibility – target flexibility – can do simple things like sidechain rotation, but nothing major like backbone flexibility. Sometimes docking to multiple structures snapshots resulting from molecular dynamics
8. Assumption of ligand overlap – lots of 3D shape based VS attempt to maximize the overlap between ligands – but based on X-ray structures this is not always the case (different ligands may occupy slight different regions of the binding pocket).
9. Missing positive controls – Strict cutoff stops you from retrieving positive controls in your Virtual Screening experiment. Selectivity (lower number of false postives)/ sensitivity (larger percentage of true positives) cutoff needs to be determined appropriately.

In conclusion, VS can be run by a monkey – but if that is the case expect bad results. Careful database preparation, judicious parameter choices, use of positive controls, and sensible compromises between the different goals one attempts to obtain are required. VS probabilistic game – careful planning and attention to detail increases probability of success.

# How to install RDKit on Ubuntu 12.04 / 12.10 / 13.04 / 13.10 / 14.04 / 14.10 (with InChI support)

I make extensive use of this brilliant piece of cheminformatics software (RD)kit, and it has saved me writing my own Molecule, Atom, Bond, Conformer, Fingerprint, SimilarityMetric, Descriptor etc. classes time and time again.  It is really neat, and works with C++ and python (and Java I think).  Here are the instructions on how to install it on a relatively recent Ubuntu version (12.04 / 12.10 / 13.04 / 13.10 / 14.04 / 14.10).

Pre-requisite software (this is why I love Debian based systems)

sudo apt-get install flex bison build-essential python-numpy cmake python-dev sqlite3 libsqlite3-dev libboost-dev  libboost-python-dev libboost-regex-dev

Get the latest RDKit goodie from here (watch out – version number has been replaced by ‘X’ below)

wget http://downloads.sourceforge.net/project/rdkit/rdkit/QX_20XX/RDKit_20XX_XX_X.tgz

Unzip the beast, save it to /opt

sudo tar xzvf RDKit_20XX_XX_X.tgz -C /opt

Add some environment salt, vim ~/.bashrc

export RDBASE=/opt/RDKit_20XX_XX_X
export LD_LIBRARY_PATH=$RDBASE/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$RDBASE:$PYTHONPATH

. ~/.bashrc

if you want the InChI stuff (trust me you do), first:

cd $RDBASE/External/INCHI-API/ ./download-inchi.sh Build (compile), install & test cd$RDBASE
mkdir build
cd build
cmake .. # if you do not care for InChI support OR
cmake -DRDK_BUILD_INCHI_SUPPORT=ON .. # to install InChI generation code
make # -j 4 to use multiple processors
make install
ctest

If all your tests passed successful you are good to go.  Otherwise, get in touch via the post comments below.

# Journal Club: A mathematical framework for structure comparison

For my turn at journal club I decided to look at this paper by Liu et. al. from FSU.

The paper, efficiently titled ‘A mathematical framework for protein structure comparison’, is motivated by the famously challenging problem of structure comparison. Historically, and indeed presently, the most respected methods of structural classification (dividing up the protein universe by structural similarity) have required a significant amount of manual comparison. This is essentially because we have no formal consensus on what similarity between protein structures truly means and how it should be measured. There are computational challenges to efficient alignment as well but, without a formal framework and a well-defined metric, structure comparison remains an ill-posed problem.

The solution for the authors was to represent each protein structure as a continuous, elastic curve and define the deformation between any two curves as the distance between their structures. We consider, hypothetically at first, a geometric space where each point represents a 3D curve (or potential protein structure). One problem with studying protein structures as geometric objects in Euclidean 3D space is that we really want to ignore certain shape-preserving transformations, such as translations, rotations, scaling and re-parametrization. Ideally therefore, we’d like our geometric space to represent curves unique up to a combination of these transformations. That is, if a protein structure is picked up, turned upside down, then put down somewhere else, it should still be treated as the same shape as it was before. Let’s call this space the shape space S.

Key to the formulation of this space is the representation of a protein structure by its square root velocity function (SRVF). We can represent a protein structure by a continuous function β, which maps the unit interval onto 3D space: β: [0,1] → R3. The SRVF of this curve is then defined as:

where β'(t) is the derivative of β. q(t) contains both the speed and the direction of β but, since it is the derivative, it is invariant to any linear translation of β. So, simply by using the SRVF representation of a curve, we have eliminated one of the shape-preserving transformations. We can eliminate another, rescaling, by requiring β(t) to have unit length: ∫|β'(t)|2dt = ∫|q(t)|2dt = 1.

The space of q‘s is not the shape space S as we still have rotations and re-parametrizations to account for but it is a well-defined space (a unit sphere in the Hilbert space L2 if you were wondering). We call this space the preshape space C. The main advantage of the SRVF representation of curves is that the standard metric on this space explicitly measures the amount of deformation, in terms of stretching and bending, between two curves.

All that is left to do is to eliminate the effects of an arbitrary rotation and re-parametrization. In fact, instead of working directly in S, we choose to remain in the preshape space C as its geometry is so well-defined (well actually, it’s even easier on the tangent space of C but that’s a story for another time). To compare two curves in C we fix the first curve, then find the optimal rotation and parametrization of the second curve so as to minimise the distance between them. Computationally, this is the most expensive step (although almost every other global alignment will need this step and more) and consists of firstly applying singular value decomposition to find the rotation then a dynamic programming algorithm to find the parametrization (this is the matching function between the two backbone structures… you could add secondary structure/sequence constraints here to improve the relevance of the alignment).

Now we can compare q1 and q2 in C. Not only can we calculate the distance between them in terms of the deformation from one to the other, we can also explicitly visualise this optimal deformation as a series of curves between q1 and q2, calculated as the geodesic path between the curves in C:

where θ = cos-1( ∫ <q1,q2>dt) is the distance between q1 and q2 in S.

The geodesic path between protein 1MP6, the left-most structure, and protein 2K98, the right-most structure.

As a consequence of this formal framework for analysing protein structures it’s now possible, in a well-defined way, to treat structures as random variables. Practically, this means we can study populations of protein structures by calculating their mean structure and covariances.

# How to make a custom latex bibliography style

Imagine you are writing up your latest thrilling piece of science in your favourite odt or docx format. Nothing comes from nothing so you need to cite the 50 or so people whose ideas you built on, or who came to conclusions that contradict yours. Suddenly you realize that your second sentence needs a reference… and this will require you to renumber the subsequent 50. What a drag! There goes 5 minutes of your life that could have been better spent drinking beer.

Had you written your research in latex instead, this drudgery would have been replaced by a range of much more interesting and intractable difficulties. In latex it is easy to automagically renumber everything just by recompiling the document. Unfortunately, it is often hard to direct latex’s magic… just try moving a picture an inch to the right, or reformatting a reference.

Moving figures around is still a black art as far as I’m concerned… but I’ve recently found out an easy way to reformat references. This might be especially handy when you find out that your sort of proteins fall out of the scope of the International Journal of Eating Disorders and you now want to submit to a journal that requires you to list authors in small-caps, and the dates of publication in seconds from the Unix epoch.

A good way of including references in latex is with a “.bib” file and a “.bst” file. An example of the end of a document is shown below.

 \bibliographystyle{myfile} \bibliography{mycollection}

\end{document} 

What’s happening here? All my references are stored in bibtex format in a database file called “mycollection.bib”. A separate file “myfile.bst” says how the information in the database should be presented. For example, are references in the text of the form (Blogs et al 2005) or are they numbered (1)? At the end of the text are they grouped in order of appearance, by date of publication or alphabetically? If alphabetically does “de Ville” come under “d” or “v”? To reformat a reference, we simply need to change “myfile.bst”.

Most latex distributions come with a set of bibliography styles. Some examples can be found here (a page which also explains all of the above much better than I have). However, it is very easy to produce a custom file using the custom-bib package. After a one-click download it is as simple as typing:

 latex makebst.ins latex makebst.tex 

Here’s a screenshot to prove it. At the bottom is the first of thirty or forty multiple-choice questions about how you want your references to look. If in doubt, just close your eyes and press return to select the default.

The problem with a multiple-choice fest is that if you make a poor decision at question 28 you have to go through the whole process again. Fortunately, this can be circumvented — as well as generating a pretty “myfile.bst” file, the custom-bib package generates an intermediate file “myfile.dbj”. Changing your multiple-choice answers is just a matter of commenting out the relevant parts and typing “latex myfile.dbj”. A snippet of a “dbj” file is below:

Selected options are those without a “%” sign on the left hand side. Who would have thought that Latex could be so cuddly?

# Conservation of kinks in membrane proteins

In my Part II project, I have been comparing pairs of homologous alpha-helical transmembrane proteins from the PDB, looking specifically at how often kinks are conserved. Kinks are of particular interest in membrane protein structure as they can introduce flexibility for movement between conformations and are therefore important for function.

Kinks typically involve some disruption of the usual pattern of hydrogen bonding in the helix, so proline is frequently found at the location of helix kinks due to its lack of an N-H group. However, there are many kinks where proline is not present, and mutation of a proline residue to alanine at the site of a kink does not necessarily lead to removal of the kink. This led to the suggestion that all kinks are originally caused by prolines, but consequent changes to the surrounding structure mean that the kink can be retained even when a mutation occurs at a later point. Therefore I have particularly been looking at the incidence of prolines around kinks in the pairs of homologous helices.

In order to find kinks in helices I have been using two different programs: Kink Finder, written by Henry, another member of the group, and MC-HELAN. Kink Finder fits cylinders to a helix to measure angles at each residue, while MC-HELAN finds sections of ideal helix and labels any overlapping sections as kinks. These contrasting methods both indicate the extent of the kink by giving an angle between the axes fitted to sections of helix on either side. TM-align and MP-T have both been used to align the pairs of proteins and find homologous helices, and I am also looking at the multiple sequence alignments provided by MP-T to question whether proline can always be found in a homolog at the position of the kink, even if it is not present in the protein itself.

This work will hopefully give a better indication of when kinks should be predicted, particularly when using a program such as MEDELLER to predict membrane protein structure by homology modelling. If it were known in which circumstances a kink could be confidently predicted as conserved, the accuracy of the prediction of membrane regions could be further improved.

# Journal Club: The complexity of Binding

Molecular recognition is the mechanism by which two or more molecules come together to form a specific complex. But how do molecules recognise and interact with each other?

In the TIBS Opinion article by Ruth Nussinov group, an extended conformational selection model is described. This model includes the classical lock-and-key, induced fit, conformational selection mechanisms and their combination.

The general concept of equilibrium shift of the ensemble was proposed nearly 15 years ago, or perharps earlier. The basic idea is that proteins in solution pre-exist in a number of conformational substates, including those with binding sites complementary to a ligand. The distribution of the substates can be visualised as free energy landscape (see figure above), which helps in understanding the dynamic nature of the conformational equilibrium.

This equilibrium is not static, it is sensitive to the environment and many other factors. An equilibrium shift can be achieved by (a) sequence modifications of special protein regions termed protein segments, (b) post-translational modifications of a protein, (c) ligand binding, etc.

So why are these concepts discussed and published again?

While the theory is straight-forward, proving conformational selection is hard and it is even harder to quantify it computationally. Experimental techniques such Nuclear Magnetic Resonance (NMR), single molecule studies (e.g. protein yoga), targeted mutagenesis and its effect on the energy landscape, plus molecular dynamics (MD) simulations have been helping to conceptualise conformational transitions. Meanwhile, there is still a long way to go before a full understanding of atomic scale pathways is achieved.

# How to make environment variables persist in your web app (Apache2)

Last Sunday, with the Memoir gang, we were talking about using this blog as a technical notebook. This post is in that spirit.

We have moved most of the validation of Memoir before submitting the job to the queue.  This has the advantage that the user knows immediately the submission has failed (instead of waiting for the job to be processed in the queue).  The only issue with this is that the apache user (on most systems www-data) is going to run the validation pipeline.  Memoir requires a ton of PATH variables addition.

In most cases you cannot add variables in .bashrc as the apache user does not have a home directory.  The easiest way how to add environment variables for the apache user is:

sudo vim /etc/apache2/envvars

And add the following line at the bottom (these have to be customized according to the server):

export PATH=\$PATH:/opt/tmalign:/opt/muscle:/opt/joy-5.10:/opt/joy-5.10/psa:/opt/joy_related-1.06:/opt/joy_related-1.06/sstruc:/opt/joy_related-1.06/hbond:/opt/medeller-dev/bin:/opt/usearch6.0.307:/opt/ncbi-blast-2.2.27+:/opt/MPT/bin

After this restart apache – and you should be laughing. Or, as is often the case, apache will be laughing at you.

sudo service apache2 restart

The apache user should now “see” the amended path variable.

Here it is this technical note – so in a few weeks time when I forget all of the above I can just copy and paste it …

(This post is short, also because there is the superbowl!!)