Journal Club: “Discriminative Chemical Patterns: Automatic and Interactive Design”

For Journal Club this week I decided to discuss the following paper by M. Rarey et al., which describes a method of using SMARTS patterns to discriminate between two sets of molecules. Link to paper here.

Given two sets of molecules can one generate a pattern that discriminates between two sets? This relates to a key question in drug design: can we predict whether molecules bind or not given a set of binders and a set of non-binders. The method is of particular interest because it makes use of data available, unlike conventional methods. However, for this technique to work, the correct molecular classification is required to discriminate between the two sets of molecules.

Originally molecules were classified using physiochemical properties for example, molecular weight or log P. However these classifications are too general and do not encompass enough molecular detail for accurate discrimination. An alternative is to using topological fingerprints. These encode a set the presence of a set of topological features using a series of bits. One of the limitations with this classification is that it is restricted by the predefined set of structures and features. This method makes use of chemical patterns which advantageously can can classify a chemical feature that cannot be sufficiently described by molecular substructure.

SMARTS (a molecular description language based on SMILES) allow description of structures with varying levels of specificity. For example one can specify atomic element, whether the atom is a subset of elements, whether it is aliphatic or aromatic, or whether it is in a ring. The method makes use of this description of molecules as the group have already developed some software to visualise SMARTS strings and modify them: the SMARTSeditor.

The method involves combining automatic pattern generation and visualisation to form SMARTSminer. Given two distinct molecule sets, the algorithm derives connected chemical patterns to differentiate both sets by using a sub-graph mining technique: solutions are extended by single elements iteratively.

The SMARTSminer was then used to test a series of test cases using the DUD (Database of Useful Decoys) data set. This seems strange when the data set has been shown to be inaccurate and perhaps there are more accurate test sets available, such as DUDe (Database of Useful Decoys enchanced). Let us look a couple of these case sets in more detail.

  1. Discrimination between Active Molecules on Similar Targets

The first case set looks at discriminating between molecules that are active for COX-1 and COX-2. COX proteins are cyclooxygenase that are involved in inflammatory reaction. These proteins are targeted by inhibitors such as aspirin and ibuprofen for the relief of inflammatin and pain. Both COX-1 and COX-2 are similar targets with similar molecular weight and 65% sequence identity. Selective inhibition is only due to a difference in residue at position 523.

Separation of the sets of molecules was possible with a pattern identified that hit 21/25 of the molecules active for COX-1 and 15/348 of molecules of molecules active for COX-2. When the positive and negative set are reversed a pattern is identified that matched 313/348 of COX-2 actives but only 1 of the COX-1 ligands. The group state that perfect separation is not possible as there is an overlap of 2 molecules.

It is interesting that patterns were identified that could discriminate between the two sets. However, there is no discussion of how to use this information. Additionally the pattern determined has not been tested on any molecules outside of the training set – there are no blind tests. This seems strange as a blind test could emphasise the usefulness of this method if it was successful.

2. Discrimination between Active and Inactive Molecules

The second case investigates determining whether a pattern can be generated that discriminates between active and inactive targets. The test case used target SAHH (S-adenosyl-homocysteine hydrolase). A pattern was generated that matched all active molecules and only 1% of inactives. What is particularly exciting is that the pattern found contains part of the interaction network hydrogen bonding partners of the ligand, as shown in the figure below (the pattern identified is highlighted in green).


I find it very surprising that the group did not follow up with blind tests of molecules not used in the training set – especially as the pattern identified a key part of the binding mechanism.

To summarise a new method, SMARTSminer, calculates discriminative patterns between two sets of molecules using the SMARTS language. The authors state that the method has shown applicability in several use cases covering the application of actives vs decoys, kinase classifications, analysis of data sets and characterisation of reaction centers. However, I’m not sure I can agree with that statement. I believe further blind tests would be required to prove the applicability of the method once the pattern has been found. I also believe that an analysis of whether the pattern is over fitted to the training data is also required.