Customising MCS mapping in RDKit

Finding the parts in common between two molecules appears to be a straightforward, but actually is a maze of layers. The task, maximum common substructure (MCS) searching, in RDKit is done by Chem.rdFMCS.FindMCS, which is highly customisable with lots of presets. What if one wanted to control in minute detail if a given atom X and is a match for atom Y? There is a way and this is how.

Aim

First, there are several resources on the basics of MCS, such as Greg Landrum’s tutorial on the subject. Herein, I want to discuss a specific use case: full control over the atom matching itself.
There are two options: calculate all combinations and filter them, or create a custom atom matcher. The first option will grind to a halt very quickly, so the second option is often the only viable one.
For Fragmenstein in order to map a molecule onto overlapping 3D molecules I had to resort to the latter deep in the code.

Basics

As the basics are well covered elsewhere, I am going to wizz through them.

To find the MCS between two molecules, the following code can be used.
The function `name2mol` (get a mol from a name via NIH Cactus) and `flatgrid` (a `Draw.MolsToGrid` wrapper) are defined in the footnote.

from rdkit import Chem
from rdkit.Chem import rdFMCS

capsaicin = name2mol('capsaicin')
jasminaldehyde = name2mol('jasminaldehyde')
result: rdFMCS.MCSResult = rdFMCS.FindMCS([capsaicin, jasminaldehyde])
flatgrid([capsaicin, jasminaldehyde, Chem.MolFromSmarts(result.smartsString)])

This is not what one would expect, because one needs to tweak the parameters of the search to match that expectation.
The function rdFMCS.FindMCS went throw several evolutions and is overloaded. The first older/messier way is that it accepts several arguments, which I would say fall into two groups:

  • The boolean arguments matchValences, ringMatchesRingOnly, completeRingsOnly, matchChiralTag
  • The compare enum arguments atomCompare, bondCompare, ringCompare

The two are mostly mutually exclusive but don’t have a univocal mapping (ringMatchesRingOnly, completeRingsOnly are both covered by ringCompare). The compare enums are more powerful.
For example, here are the values they can take:

  • atomCompare argument accepts the rdFMCS.AtomCompare enum, which has the values
    CompareAny, CompareElements, CompareIsotopes, CompareAnyHeavyAtom.
  • bondCompare wants rdFMCS.BondCompare
    CompareAny, CompareOrder, CompareOrderExact
  • ringCompare wants rdFMCS.RingCompare
    IgnoreRingFusion, PermissiveRingFusion, StrictRingFusion

The newer one, accepts a single parameter object (rdFMCS.MCSParameters). This objects two attributes with boolean-holding namedtuple-like objects (AtomCompareParameters and BondCompareParameters), which are a reshuffle of the boolean arguments; and two “typer” attributes (AtomTyper and BondTyper), which can be more than just the enums and allows some nice stuff.

For Fragmenstein, I was in need of transpiling the former system into the latter as arguments may be passed in the former system but the latter was used. As a result I had a function to do so:

def transmute_FindMCS_parameters(
        **mode: Unpack[ExtendedFMCSMode]) -> rdFMCS.MCSParameters:  # noqa lowercase not applicable
    """
    The function ``rdFMCS.FindMCS`` has two ways of being used.
    In one, a series of arguments are passed,
    in another a ``rdFMCS.MCSParameters`` object is passed (a wrapped C++ structure).
    Unfortunately, there does not seem to be a way to transmute the former into the other.

    Hence, this function

    The ``params.AtomTyper`` and ``params.BondTyper`` members
    can be either

    * the enum ``rdFMCS.AtomCompare`` or ``rdFMCS.BondCompare``
    * or a subclass of ``rdFMCS.MCSBondCompare`` or ``rdFMCS.MCSAtomCompare``

    The ``rdFMCS.RingCompare`` appears to be absorbed into the ``params.BondCompareParameters``
    member.
    """
    params = rdFMCS.MCSParameters()
    # three integers: https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/FMCS/FMCS.h
    # they are not rdFMCS.AtomCompare rdFMCS.BondCompare and rdFMCS.RingCompare enums?
    # atom parameters
    atomCompare: rdFMCS.AtomCompare = mode.get('atomCompare', rdFMCS.AtomCompare.CompareElements)
    params.AtomTyper = atomCompare  # int or a callable
    params.AtomCompareParameters.MatchIsotope = atomCompare == rdFMCS.AtomCompare.CompareIsotopes
    params.AtomCompareParameters.CompleteRingsOnly = mode.get('completeRingsOnly', False)
    params.AtomCompareParameters.MatchChiralTag = mode.get('matchChiralTag', False)
    params.AtomCompareParameters.MatchValences = mode.get('matchValences', False)
    params.AtomCompareParameters.RingMatchesRingOnly = mode.get('ringMatchesRingOnly', False)
    # bond parameters
    bondCompare: rdFMCS.BondCompare = mode.get('bondCompare', rdFMCS.BondCompare.CompareOrder)
    ringCompare: rdFMCS.RingCompare = mode.get('ringCompare', rdFMCS.RingCompare.IgnoreRingFusion)
    params.BondTyper = bondCompare
    params.BondCompareParameters.CompleteRingsOnly = mode.get('completeRingsOnly', False)
    params.BondCompareParameters.MatchFusedRings = ringCompare != rdFMCS.RingCompare.IgnoreRingFusion
    params.BondCompareParameters.MatchFusedRingsStrict = ringCompare == rdFMCS.RingCompare.StrictRingFusion
    params.BondCompareParameters.RingMatchesRingOnly = mode.get('ringMatchesRingOnly', False)
    params.Threshold = mode.get('threshold', 1.0)
    params.MaximizeBonds = mode.get('maximizeBonds', True)
    params.Timeout = mode.get('timeout', 3600)
    params.Verbose = mode.get('verbose', False)
    params.InitialSeed = mode.get('seedSmarts', '')
    # parameters with no equivalence (i.e. made up)
    params.BondCompareParameters.MatchStereo = mode.get('matchStereo', False)
    params.AtomCompareParameters.MatchFormalCharge = mode.get('matchFormalCharge', False)
    params.AtomCompareParameters.MaxDistance = mode.get('maxDistance', -1)
    # params.ProgressCallback Deprecated
    return params

Comparators

This meandering around legacy code-burden is to get to the new comparator objects. The atomTyper attribute accepts the aforementioned enums, but also can accept a callable… such as a subclass of rdFMCS.MCSAtomCompare or rdFMCS.MCSBondCompare. These are callable class instances that when called with the parameter obejct, the two molecules and the two atoms or bonds that are to be compare, they return a boolean verdict. The largest number of contiguous truths wins!

Obviously, there’s a catch. RDKit is PyBoost11–wrapped C++ code —as is evident by its flagrant disregard for PEP8 guidelines turned dogmas. In the case of subclassing a wrapped C++-class one cannot do super calls to unwrapped methods of the base class as the Python signatures will not match the C++ ones.
As a result the method __call__ needs to be written without parent calls:

class VanillaCompareAtoms(rdFMCS.MCSAtomCompare):
    """
    Give than the following does not work, one has to do it in full:

    .. code-block:: python
        :caption: This will raise an error:
        super().__call__(parameters, mol1, atom_idx1, mol2, atom_idx2)

    This class replicates the vanilla functionality
    """

    def __init__(self, comparison: rdFMCS.AtomCompare = rdFMCS.AtomCompare.CompareAnyHeavyAtom):
        """
        Whereas the atomCompare is an enum, this is a callable class.
        But in parameters there is no compareElement booleans etc. only Isotope...
        In https://github.com/rdkit/rdkit/blob/master/Code/GraphMol/FMCS/Wrap/testFMCS.py
        it is clear one needs to make one's own.
        """
        super().__init__()  # noqa the p_obect (python object aka ctypes.py_object) is not used
        self.comparison = comparison

    def __call__(self,  # noqa signature matches... it is just Boost being Boost
                 parameters: rdFMCS.MCSAtomCompareParameters,
                 mol1: Chem.Mol,
                 atom_idx1: int,
                 mol2: Chem.Mol,
                 atom_idx2: int) -> bool:
        a1: Chem.Atom = mol1.GetAtomWithIdx(atom_idx1)
        a2: Chem.Atom = mol2.GetAtomWithIdx(atom_idx2)
        # ------- isotope ------------------------
        if parameters.MatchIsotope and a1.GetIsotope() != a2.GetIsotope():  # noqa
            return False
        elif self.comparison == rdFMCS.AtomCompare.CompareIsotopes and a1.GetIsotope() != a2.GetIsotope():  # noqa
            return False
        elif self.comparison == rdFMCS.AtomCompare.CompareElements and a1.GetAtomicNum() != a2.GetAtomicNum():  # noqa
            return False
        elif self.comparison == rdFMCS.AtomCompare.CompareAnyHeavyAtom \
                and (a1.GetAtomicNum() == 1 or a2.GetAtomicNum() == 1):  # noqa
            return False
        elif self.comparison == rdFMCS.AtomCompare.CompareAny:
            pass
        # ------- valence ------------------------
        if parameters.MatchValences and a1.GetTotalValence() != a2.GetTotalValence():  # noqa
            return False
        # ------- chiral ------------------------
        if parameters.MatchChiralTag and not self.CheckAtomChirality(parameters, mol1, atom_idx1, mol2, atom_idx2):
            return False
        # ------- formal ------------------------
        if parameters.MatchFormalCharge and not self.CheckAtomCharge(parameters, mol1, atom_idx1, mol2, atom_idx2):
            return False
        # ------- ring ------------------------
        if parameters.RingMatchesRingOnly and not self.CheckAtomRingMatch(parameters, mol1, atom_idx1, mol2, atom_idx2):
            return False
        # ------- complete ------------------------
        if parameters.CompleteRingsOnly:
            # don't know its intended to be used
            pass
        # ------- distance ------------------------
        if parameters.MaxDistance:  # integer...
            # todo fill!
            pass
        return True

Now, we can subclass it and change the call as we please.

In my case, I wrote a class that accepts a mapping that has to be obeyed. For that I still however, need to add a method to filter out the matching substructures, but that is an easy feat as there is already a method that says when two atoms match, i.e. the __call__ method of the comparator!

Edit

In recent versions, the comparisons are called by rdFMCS.FindMCS not only between molA to molB atoms, but also molB to molB atoms, presumably to address isomorphisms.

Footnote

The following snippet to gets me a molecule from a name using Cactus:

import requests
from urllib.parse import quote
from rdkit import Chem
from rdkit.Chem import AllChem

def name2mol(name) -> Chem.Mol:
    """
    Using Cactus get a molecule from a name.
    """
    response: requests.Response = requests.get(
        f'https://cactus.nci.nih.gov/chemical/structure/{quote(name)}/smiles')
    response.raise_for_status()
    smiles: str = response.text
    mol: Chem.Mol = Chem.MolFromSmiles(smiles)
    mol.SetProp('_Name', name)
    AllChem.EmbedMolecule(mol)
    return mol

I always want to see the molecules on a grid but in 2D, so:

from rdkit.Chem import Draw
from typing import List
from IPython.display import Image

def flatgrid(mols, *args, **kwargs) -> Image:
    copies: List[Chem.Mol] = [Chem.Mol(m) for m in mols]
    *map(AllChem.Compute2DCoords, copies),   # noqa, it's in place
    if 'legends' not in kwargs:
        kwargs['legends'] = [m.GetProp('_Name') if m.HasProp('_Name') else '-' for m in mols]
    return Draw.MolsToGridImage(copies, *args, **kwargs)

Obviously, in the main text I have a figure, so the code was in reality:

from IPython.display import Image

img: Image = flatgrid([capsaicin, jasminaldehyde, Chem.MolFromSmarts(result.smartsString)])
with open('molecules.png', 'wb') as fh:
    fh.write(image.data)

Author