Mapping derivative compounds to parent hits | Oxford Protein Informatics Group

Whereas it is easy to say in a paper “Given the HT-Sequential-ITC results, 42 led to 113, a substituted decahydro-2,6-methanocyclopropa[f]indene”, it is frequently rather trickier algorithmically figure out which atoms map to which. In Fragmenstein, for the placement route, for example, a lot goes on behind the scenes, yet for some cases human provided mapping may be required. Here I discuss how to get the mapping from Fragmenstein and what goes on behind the scenes.

Here be dragons! (Not of any kind, but kaijū of the terrible sequel variety)

When there is a single parent hit and it is a single substructure of the derivative, the mapping is easy. However, when it is not, then a maximum common substructure or edit distance approach is required. The former will have different results depending on the strictness, e.g. ignoring isotopes, mapping ring atoms to non-ring atoms, ignoring bond order or ignoring even elements. This is okay for substituents, but not for scaffold hops or atom insertions. Then to make matters more complicated, when two or more parent hits are in play, the position of the respective atoms becomes critical. And even then not for all as an unimportant substituent may need to be ignored. In some cases, not all parent hits actually contributed and are enumerated for other derivatives (cf. MPro Moonshot data).

Enumerating the various examples is beyond the scope of this post —on request I might expand. Instead, I simply hope to impart it is not a trivial process and any tool that says otherwise is extremely simplistic.

Fragmenstein placement

In the placement route of Fragmenstein in the default approach (‘expansion’) atoms of a provided derivative compound are mapped to the atoms of the parent hits allowing the coordinates of these to be stitched together. In this tutorial I go through how one can get this information what goes on behind the scenes.

Let’s start with a simple placement. The chemistry operations irrespective of the protein are done by the class Monster, so let’s start there. It’s dinner time, so let’s go with a foodie compound, capsaicin, which is a vanillamine derivative (a coumaric acid derivative).

# ## Faux Hit
from typing import Callable

get_smiles_from_chembl: Callable[[str,], str] = ...  # from footnote
capsaicin = Chem.MolFromSmiles(get_smiles_from_chembl('capsaicin'))
capsaicin.SetProp('_Name', 'capsaicin')
AllChem.EmbedMolecule(capsaicin)

# ## Place
vanillin_smiles = get_smiles_from_chembl('vanillin')
monstah = Monster(hits=[capsaicin])
monstah.place_smiles(vanillin_smiles)
monstah.positioned_mol

In Fragmenstein, the parent hit atoms are referred to as “origins” (this is non-standard for reasons that I cannot recall) and one can view the mapping in a Jupyter notebook via the method Monster.show_comparison.

Like other visuals in Fragmenstein (eg. to_3Dmol, to_nglview or make_pse), the colours are actually in the Fragmenstein colour branding. The colours are assigned based on fragmenstein.branding.divergent_colors (Dict[int, List[str]]), which are divergent colours in HChL colourspace starting from the primary colour, feijoa, a tint of Chartreuse (an infamous hue in OPIG due to Shrek Green). One can get mapping to hex codes via Monster.get_color_origins, but that is pointless. We want the actual mapping, which can be done via Monster.convert_origins_to_custom_map (which can be passed as an exhaustive custom map —note that the custom map for the place method of some Fragmenstein classes need not be exhaustive as generally a well-placed -1 (ignore corresponding index in mapping) will do the trick.

{'capsaicin': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: -1, 8: -2, 9: -3, 10: -4, 11: -5, 12: -6, 13: -7, 14: -8, 15: -9, 16: -10, 17: -11, 18: 7, 19: 8, 20: 9, 21: 10}}

The method name is rather cryptic simply because the “origin” is actually not stored as a custom map. For example the method Monster.origin_from_mol gives this gibberish:

[['capsaicin.0'], ['capsaicin.1'], ['capsaicin.2'], ['capsaicin.3'], ['capsaicin.4'], ['capsaicin.5'], ['capsaicin.6'], ['capsaicin.18'], ['capsaicin.19'], ['capsaicin.20'], ['capsaicin.21']]

Which in turn is from a private property (_Origin) of each atom in the Chem.Mol.
This is all relying on the attribute .positioned_mol. If one were to move away from the position one could use the method Monster.get_mcs_mapping(parent, derivative) on different molecules.

Workings

The way the multi-component mapping is done is by finding overlapping atoms in the parent hits, and then iteratively mapping with laxer and laxer settings across all parent hits.

This is not the perfect solution, which covers all cases (cf. Here be dragons section), but covers a lot of them, primarily cases with small differences, but not scaffold hops.

What happens is that the derivative is mapped with a strict setting to each hit (unless the primary is manually specified) and the map which covers the most atoms is chosen as the starting point. This map is converted to cover the overlapping atoms in other hits (more below), and more lax searches are performed using the previous map as a guide (more below) until no more atoms can be placed.

The atomic overlap follows certain rules, as described in the paper . In Monster, the class method Monster.get_positional_mapping finds overlapping atoms between two compounds according to these rules. Some atomic overlaps may be ‘red herrings’. In the NUDT7 example in the paper, an oxygen atom between the two hits was such an example.

Now for the “fun” part. Custom MCS mapping in RDKit.

The MCS mapper as implemented in RDKit works roughly in three steps, first every atom in the first compound is compared to every atom in the other wherein a function returns true or false based on certain criteria, such as element, isotope etc. The same is done for bonds. Then the network with the most connected nodes is chosen. In RDKit, with recent-ish changes, one can fine tune this.
In the case of Fragmenstein a map is provided forcing the comparison. In RDKit documentation there is a nice guide on how to fine tune the mapping.
Fragmenstein does something similar (code), but has to account for more so there’s a few brick walls to jump. Here is your bog standard MCS:

from rdkit.Chem import rdFMCS

params = rdFMCS.MCSParameters()
params.AtomTyper = rdFMCS.AtomCompare.CompareElements  # or a custom class based off `rdFMCS.MCSAtomCompare`
params.BondTyper = rdFMCS.BondCompare.CompareOrder  # or a custom class based off `rdFMCS.MCSBondCompare`
res: rdFMCS.MCSResult = rdFMCS.FindMCS([molA, molB], params)

Parenthetically, until last year MCS mapping had different parameters, hence why Fragmenstein has a transmute_FindMCS_parameters method (code) to convert arguments to a rdFMCS.MCSParameters.

The first is that whereas one can subclass rdFMCS.MCSAtomCompare one cannot use the super command with __call__ given that its base class does not have a ctypes.py_object call object (its a C++ method), so the basic functionality of rdFMCS.MCSAtomCompare.__call__ was re-implemented (code), allowing the map based comparison to work (code). A benefit of this is that one can add extra functionality to reduce overhead, in this case the method get_valid_matches to get the mapping from the MCS comparator. One caveat is that the order of the atoms and molecules in the comparisons may not always be as expected.

Footnote: Format

I will admit I do not know what is the ideal format for atom mapping.

The custom map is a dictionary of key = hit title (_Name property in RDKit Chem.Mol) and value a dictionary of hit index to derivative index. This relies on some assumptions:

there is only one derivative, but multiple parents
the parents have titles (although there are several checks in place to force this)
the atom index is not changed

The latter could have been done via the atom name, however, this would require PDB entries or mol/sdf with the A tag (atomic property molFileAlias in RDkit), which is seldomly used. As a result occasionally, a custom map may need renumbering hence the method renumber_followup_custom_map.

Footnote: Auxiliary code

This code is not important, but is used to make the example.

import requests

def get_smiles_from_chembl(compound_name, entry_number=0):
    """
    Fetch the SMILES string for a given compound name from ChEMBL.
    """
    url = f'https://www.ebi.ac.uk/chembl/api/data/molecule/search?q={compound_name}&format=json'
    response = requests.get(url)
    response.raise_for_status()
    data = response.json()
    if data['molecules']:
        smiles = data['molecules'][entry_number]['molecule_structures']['canonical_smiles']
        return smiles
    else:
        raise ValueError(f"Compound {compound_name} not found in ChEMBL database")

Author

Matteo Ferla

View all posts