Author Archives: Anthony Bradley

OOMMPPAA: A tool to aid directed synthesis by the combined analysis of activity and structural data

Motivation

Recently I published a paper on OOMMPPAA, my 3D matched-molecular pair tool to aid drug discovery. The tool is available to try here, download here and read about here. OOMMPPAA aims to tackle the big-data problem in drug discovery – where X-ray structures and activity data have become increasingly abundant. Below I will explain what a 3D MMP is.

What are MMPs?

MMPs are two compounds that are identical apart from one structural change, as shown in the example below. Across a set of a thousand compounds, tens of thousands of MMPs can be found. Each pair can be represented as a transformation. Below would be Br >> OH. Within the dataset possibly hundreds of Br >> OH will exist.

mmp

An example of an MMP

 

 

 

 

 

 

Each of these transformations will also be associated with a change in a measurable compound property, (solubility, acidity, binding energy, number of bromine atoms…). Each general transformation (e.g. Br>>OH) therefore would have a distribution of values for a given property. From this distribution we can infer the likely effect of making this change on an unmeasured compound. From all the distributions (for all the transformations) we can then predict the most likely transformation to improve the property we are interested in.

The issue with MMPs

Until recently the MMP approach had been of limited use for predicting compound binding energies. This is for two core reasons.

1) Most distributions would be pretty well normally distributed around zero. Binding is complex and the same change can have both positive and negative effects.

2) Those distributions that are overwhelmingly positive, by definition, produce increased binding against many proteins. A key aim of drug discovery is to produce selective compounds that increase binding energy only for the protein you are interested in. So increasing binding energy like that is not overall very useful.

3D MMPs to the rescue

3D MMPs aim to resolve these issues and allow MMP analysis to be applied to binding energy predictions. 3D MMPs use structural data to place the transformations in the context of the protein. One method, VAMMPIRE, asks what is the overall effect of Br>>OH when it is near to a Leucine, Tyrosine and Tryptophan (for example). In this way selective changes can be found.

Another method by BMS aggregates these changes across a target class, in their cases kinases. From this they can ask questions like, “Is it overall beneficial to add a cyclo-proply amide in this region of the kinase binding site”.

What does my tool do differently?

OOMMPPAA differs from the above in two core ways. First, OOMMPPAA uses a pharmacophore abstraction to analyse changes between molecules. This is an effective way of increasing the number of observations for each transition. Secondly OOMMPPAA does not aggregate activity changes into general trends but considers positive and negative activity changes separately. We show in the paper that this allows for more nuanced analysis of confounding factors in the available data.

Freezing python code

Many of us in the group use python as our primary programming language. It is in my opinion an awesome language for lots of reasons. However what happens when you write an application and want to share it with the world? Simply distributing the source code requires a great deal of configuration by the end user. I’m sure you’ve all been there, you have version 1.5.1 they use version 1.6.3. However to download and install this breaks every other bit of code you are using. Creating virtual environments can help towards this, but then do you really want to go towards all the hassle of this for every application you want to use? In the end I have given up trying on a number of projects, which is a fate you would never want for your own code!

From my point of view there are three ways of counteracting this issue.

  1. Make limited use of libraries and imports
  2. Have incredibly clear instructions on how to set up the virtual env
  3. Freeze your code!

The first solution is sometimes just not possible or desirable. For example if you want to use a web framework or connect to third party database engines. The second could be massively time consuming and it is virtually impossible to cover all bases. For example, RDKit, my favourite cheminformatics package, has a lengthy install process with platform specific quirks and many of its own dependencies.

In my project I opted for solution number three. I use PyInstaller however there are many others available (cx_freeze, py2apppy2exe). I used PyInstaller because my application uses the Django project and they offer extra Django support. Also PyInstaller is cross-platform, allowing me (in theory) to package applications for Windows, Mac and Linux using the same protocol.

Here I will briefly outline how to set freeze your code using PyInstaller. This application validates a smiles string and shows you the RDKit canonical form of the smiles string.
This is the structure of the code:

src/
  main.py
  module/
    __init__.py
    functions.py
build/
dist/

main.py is:

import sys
from module.functions import my_fun
if len(sys.argv) > 1:
  smiles = sys.argv[1]
  print my_fun(smiles)
else:
  print "No smiles string requested for validation"

functions.py is:

from rdkit import Chem
def my_fun(smiles):
  mol = Chem.MolFromSmiles(smiles)
  if mol is None:
    return "Invalid smiles"
  else:
    return "Valid smiles IN:  " + smiles + "  OUT: " + Chem.MolToSmiles(mol,isomericSmiles=True) 

  1. Download and install PyInstaller 
  2. Type the following (assuming main.py is your python script)
  3. pyinstaller src\main.py --name frozen.exe --onefile
      This will produce a the following directory structure:

      src/
        main.py
        module/
          functions.py
      build/
        frozen/
      dist/
        frozen.exe
      frozen.spec
      

      frozen.spec is a file containing the options for building the application:

      a = Analysis(['src\\main.py'],
                   pathex=['P:\\PATH\\TO\\HEAD'],
                   hiddenimports=[],
                   hookspath=None,
                   runtime_hooks=None)
      pyz = PYZ(a.pure)
      exe = EXE(pyz,
                a.scripts,
                a.binaries,
                a.zipfiles,
                a.datas,
                name='frozen.exe',
                debug=False,
                strip=None,
                upx=True,
                console=True )
      

      “build” contains files used in the building of the executable

      “dist” contains the executable that you can distribute freely around. Because I used the “–onefile” option above it creates one single .exe file. This makes the file very easy to ship – HOWEVER for large programmes this isn’t totally ideal. All the dependencies are compressed into the .exe and uncompressed into a temporary folder at runtime. If there are lots of files, this process can be VERY slow.

      So now we can run the program:

      dist/frozen.exe c1ccccc1 

      Running dist/frozen.exe returns the error: ImportError: numpy.core.multiarray failed to import
      This is because the RDKit uses this module and it is not packaged up in the frozen code. The easiest way to resolve this is to include this import in main.py:

      from rdkit import Chem
      import numpy
      import sys
      from module.functions import my_fun
      if len(sys.argv) > 1:
        smiles = sys.argv[1]
        print my_fun(smiles)
      else:
        print "No smiles string requested for validation"
      

      And there you have it. “frozen.exe” can be passed around to anyone using windows (in this case) and will work on their box.

      Obviously this is a very simple application. However I have used this to package Django applications, using Tornado web servers and with multiple complex dependencies to produce native windows desktop applications. It works! Any questions, post below!

Activity cliffs

Mini-perspective on activity cliffs as a medicinal chemistry tool

Recently in group meeting we discussed activity cliffs and their application to medicinal chemistry. The talk was based largely on the excellent mini-perspective on the subject written by Stumpfe et al.

What is an activity cliff?

Activity cliffs are two compounds that represent a small structural change but a large activity change. They are used commonly in the design of new compounds targeting a particular protein (drugs). They work on the principal that if a given structural change has previously had a large affect on activity it is likely to have a similar affect on a different compound series. In this way they can be used as predictive tools to suggest chemical transformations that are likely to improve activity for a given compound.

To define an activity cliff, one must consider what a small structural change and a large activity change mean.

Small structural change

Structural changes can be measured using a seemingly endless array of methods. A lot of methods will condense the chemical information of the molecule into a bit-vector. Each bit indicates the molecule contains a particular type of chemical functionality, e.g. a methyl group. Molecular similarity is then assessed by comparing the bit-vectors, most commonly by finding the Tanimoto similarity between the them. This then returns a single value between 0 and 1 indicating how similar the two molecules are (the greater the more similar). To define small structural change, one must decide upon a threshold value above which two molecules are sufficiently similar.
An alternative method is to find matched molecular pairs – compounds which are identical apart from one structural change. An example of one is shown below. For matched molecular pairs the only parameter required is the size of the non-matching part of the pair. This is usually measured in non-hydrogen atoms. The threshold to use for this parameter is chosen equally arbitrarily however it has a much more intuitive effect.

mmp

An example of a matched molecular pair

Which method to use?

Similarity methods are less rigid and are capable of finding molecules that are very similar, however that differ in two or more subtle ways. They however are also liable to find molecules similar when they would not be perceived as so. In this work Stumpfe et al. show that different similarity methods do not agree greatly on which molecules are “similar”. They compare six different fingerprint methods used to find similar molecules. Each method finds around 30% similar molecules in the datasets used, however the consensus between the methods is only 15%. This indicates that there is no clear definition of “similar” using bit-string similarity. Interestingly a third of the molecules found to be similar by all six fingerprint methods are not considered matched molecular pairs. This demonstrates a downside of the matched molecular pair approach, that it is liable to miss highly similar molecules that differ in a couple of small ways.

Matched molecular pairs are, however, least liable to find false-positives, i.e. compounds that are seen as similar but in fact are not actually similar. The transformations they represent are easily understood and this can be easily applied to novel compounds. For these reasons matched molecular pairs were chosen by Stumpfe et al. for this work to indicate small structural changes.

Large activity change

A large activity change is an equally arbitrary decision to make. The exact value that indicates an activity cliff will depend on the assay used and the protein being tested against. Stumpfe et al. reasonably suggest that approximate measures should not be used and that activity scores found between different assays should not be compared.

Rationales for activity cliffs

If structural data is available for an activity cliff, rationales for their corresponding activity change can be suggested. These can then be used to suggest other alterations that might have a similar impact. Stumpfe et al. consider the five most common rationales for activity cliffs.

  • H-bond and or ionic interactions: these interactions will increase the binding energy forming specific interactions with the protein
  • Lipophilic and aromatic groups: these groups can form specific protein-ligand interactions, e.g. pi-pi stacking and also form favourable interactions with hydrophobic residues in the protein
  • Water molecules: One molecule in the pair displaces water molecules from the active site, altering the binding energy
  • Stereochemistry changes: for example altering an enantiomeric form of a compound alters the projection of a group, forming or losing favourable/disfavourable protein-ligand interactions
  • Multiple effects: a combination of the above, and thus difficult to establish the dominant feature.

Are they generally useful?

Stumpfe et al. consider whether activity cliffs are more useful for some proteins or protein classes than others. They investigate how many compounds form activity cliffs for many protein targets for which activity data is openly available. For proteins with more than 200 compounds with activity data the number of activity cliff forming compounds is roughly equivalent (around 10%). This is an interesting and unexpected result. The proteins used in this study have different binding sites attracting different opportunities for protein-ligand interactions. It would not, therefore naturally be expected that these would attract similar opportunities for generating activity cliffs. This result shows that the activity cliff concept is generically useful, irrespective of the protein being targeted.

Are they predictive?

Although activity cliffs make intuitive sense, Stumpfe et al. consider whether it has been quantitatively successful in previous drug discovery efforts. They investigate all of the times that activity cliff information was available from openly available data. They then find all the times this information was used in a different compound series and if it was used whether it had a positive or negative effect on activity.

Interestingly available activity cliff information had not been used in 75% of cases. They suggest that this indicates this information is an as yet underused resource. Secondly, in the cases where it was used, 60% of the time it was successful in improving activity and 40% of the time is was unsuccessful. They suggest this indicates the activity cliff method is useful for suggesting novel additions to compounds. Indeed it is true that a method that gives a 60% success rate in predicting more potent compounds would be considered useful by most if not all medicinal chemists. It would be interesting to investigate if there were patterns in protein environment or the nature of the structural changes in the cases where the activity cliff method is not successful.

Have they been successful?

Finally Stumpfe et al. investigate whether using activity cliff information gives a higher probability of synthesising a compound in the 10% most active against the target protein. They show that in 54% of cases using activity cliff information a compound in the 10% most active is formed. Conversely when this information is not used only 28% of pathways produce a compound in the 10% most active. They argue this indicates that using activity cliff information improves the chances of producing active compounds.

Conclusion

The paper discussed here offers an excellent overview of the activity cliff concept and its application. They demonstrate, in this work and others, that activity cliffs are generally useful, predictive and currently underused. The method can therefore be used in new tools to improve the efficiency of drug discovery.

Django for scientific applications

In my current work I am developing a cheminformatics tool using structural and activity data to investigate protein-ligand binding. I have only ever properly used love python and I listen to Saulo, so I decided to used Django to develop my application. I didn’t understand what it was and why it might be useful before I started using it but below I thought I’d discuss a few of the features that I think have been useful and might encourage others to use it.

Firstly I will outline how Django works. I wanted to download all the PDB structures for CDK2 and store the information in a data structure that is robust and easily used. We have a Target and a Protein. A Target is associated to a particular UniProt accession. Cyclin-dependent kinase 2 (CDK2) is a Target. A Protein is a set of 3D coordinates, so 1AQ1 is a Protein.

class Target(models.Model):
"""A Django model to define a given protein target"""
    UniProt = models.CharField(max_length=20,unique=True)
    InitDate = models.DateTimeField(auto_now_add=True)
    Title = models.CharField(max_length=10)

In the above Target model I have three different fields. The first field denotes the UniProt accession for the Target and is “unique”. This means that only one Target can have any given UniProt accession in my data structure. If I try to add another with the same value in the UniProt field it will throw an exception. The second field denotes the time and date that the model was created. This means I can check back to when the target was created. The third is the Title I would like to use for this, for example CDK2.

I can then make a new Target objects by:

new_target = Target()
new_target.Title = "CDK2"
new_target.UniProt = "P24941"

and save it to the database by:

new_target.save() # Django takes care of the required SQL

The next model is for the Protein molecules:

class Protein(models.Model):
    """A Django model to define a given protein"""
    Code = models.CharField(max_length=6,unique=True)
    InitDate = models.DateTimeField(auto_now_add=True)
    TargetID = models.ForeignKey(Target)
    Apo = models.BoolenField()
    PDBInfo = models.FileField(upload_to='pdb')

The model contains the PDB Code, e.g. 1AQ1, and the date it was added to the database. It also consists of a foreign key, relating it to its Target and a boolean indicating if the structure is apo or holo. Finally there is a file field relating this entry to the appropriate file path where the PDB information is stored.

Once the data has been added to the database, Django then deals with all SQL queries from the database:

my_prot = Protein.objects.get(Code="1aq1") # Gives me the Protein object "1aq1"
CDK2_prots = Protein.objects.filter(TargetID__Title="CDK2") # All PDB entries associated to CDK2, as a query set, behaving similarily to a list
CDK2_list = [x for x in CDK2_prots] # Now exactly like a list

The “__” in the above query allows one to span the foreign key relationship, so it is searching for the Title of the Target not the Title of the Protein. Finally I can then access the PDB files for each of these proteins.

my_prot = Protein.objects.get(Code="1aq1") # Gives me the Protein object "1aq1"
print my_prot.Code # prints "1aq1"
# my_prot.PDBInfo has the behaviour of a file handle
pdb_lines = my_prot.PDBInfo.readlines()# Reads the lines of the file

There, you’ve made a queryable database, where Django deals with all the hard stuff and everything is native to python. Obviously in this example it might not be so difficult to imagine alternative ways of creating the same thing using directory structures, but as the structure of your data becomes more complex, Django can be easily manipulated and as it grow it utilises the speed advantages of modern databases.

Viewing ligands in twilight electron density

In this week’s journal club we discussed an excellent review paper by E. Pozharski, C. X. Weichenberger and B. Rupp investigating crystallographic approaches to protein-ligand complex elucidation. The paper assessed and highlighted the shortcomings of deposited PDB structures containing ligand-protein complexes. It then made suggestions for the community as a whole and for researchers making use of ligand-protein complexes in their work.

The paper discussed:

  • The difficulties in protein ligand complex elucidation
  • The tools to assess the quality of protein-ligand structures both qualitative and quantitative
  • The methods used describing their analysis of certain PDB structures
  • Some case studies visually demonstrating these issues
  • Some practical conclusions for the crystallographic community
  • Some practical conclusions for non-crystallographer users of protein-ligand complex structures from the PDB

The basic difficulties of ligand-protein complex elucidation

  • Ligands have less than 100% occupancy – sometimes significantly less and thus will inherently show up less clearly in the overall electron density.
  • Ligands make small contributions to the overall structure and thus global quality measures , such as r-factors, will be affected only minutely by the ligand portion of the structure being wrong
  • The original basis model needs to be used appropriately. The r-free data from the original APO model should be used to avoid model bias

The following are the tools available to inspect the quality of agreement between protein structures and their associated data.

  • Visual inspection of the Fo-Fc and 2Fo-Fc maps,using software such as COOT, is essential to assess qualitatively whether a structure is justified by the evidence.
  • Use of local measures of quality for example real space correlation coefficients (RSCC)
  • Their own tool, making use of the above as well as global quality measure resolution

Methods and results

In a separate publication they had analysed the entirety of the PDB containing both ligands and published structure factors. In this sample they demonstrate 7.6% had RSCC values of less than 0.6 the arbitrary cut off they use to determine whether the experimental evidence supports the model coordinates.

Figure to show an incorrectly oriented ligand (a) and its correction (b)

An incorrectly oriented ligand (a) and its correction (b). In all of these figures Blue is the 2mFoDFc map contoured at 1σ and Green and Red are positive and negative conturing of the mFoDFc map at 3σ

In this publication they visually inspected a subset of structures to assess in more detail how effective that arbitrary cutoff is and ascertain the reason for poor correlation. They showed the following:

(i) Ligands incorrectly identified as questionable,false positives(7.4%)
(ii) Incorrectly modelled ligands (5.2%)
(iii) Ligands with partially missing density (29.2%).
(iv) Glycosylation sites (31.3%)
(v) Ligands placed into electron density that is likely to
originate from mother-liquor components
(vi) Incorrect ligand (4.7%)
(vii) Ligands that are entirely unjustified by the electron
density (11.9%).

The first point on the above data is that the false-positive rate using RSCC of 0.6 is 7.4%. This demonstrates that this value is not sufficient to accurately determine incorrect ligand coordinates. Within the other categories all errors can be attributed to one of or a combination of the following two factors:

  • The inexperience of the crystallographer being unable to understand the data in front of them
  • The wilful denial of the data in front of the crystallographer in order that they present the data they wanted to see
Figure to show a ligand placed in density for a sulphate ion from the mother liquor (a) and it's correction (b)

A ligand incorrectly placed in density for a sulphate ion from the mother liquor (a) and it’s correction (b)

The paper observed that a disproportionate amount of poor answers was derived from glycosylation sites. In some instances these observations were used to inform the biochemistry of the protein in question. Interestingly this follows observations from almost a decade ago, however many of the examples in the Twilight paper were taken from 2008 or later. This indicates the community as a whole is not reacting to this problem and needs further prodding.

Figure to show an incomplete glycosylation site inaccurately modeled

Figure to show an incomplete glycosylation site inaccurately modeled

Conclusions and suggestions

For inexperienced users looking at ligand-protein complexes from the PDB:

  • Inspect the electron density map using COOT if is available to determine qualitatively is their evidence for the ligand being there
  • If using large numbers of ligand-protein complexes, use a script such as Twilight to find the RSCC value for the ligand to give some confidence a ligand is actually present as stated

For the crystallographic community:

  • Improved training of crystallographers to ensure errors due to genuine misinterpretation of the underlying data are minimised
  • More submission of electron-density maps, even if not publically available they should form part of initial structure validation
  • Software is easy to use but difficult to analyse the output