Better Models Through Molecular Standardization

“Cheminformatics is hard.”

— Paul Finn

I would add: “Chemistry is nuanced”… Just as there are many different ways of drawing the same molecule, SMILES is flexible enough to allow us to write the same molecule in different ways. While canonical SMILES can resolve this problem, we sometimes have different problem. In some situations, e.g., in machine learning, we need to map all these variants back to the same molecule. We also need to make sure we clean up our input molecules and eliminate invalid or incomplete structures.

Different Versions of the Same Molecule: Salt, Neutral or Charged?

Sometimes, a chemical supplier or compound vendor provides a salt of the compound, e.g., sodium acetate, but all we care about is the organic anion, i.e., the acetate. Very often, our models are built on the assumption we have only one molecule as input—but a salt will appear as two molecules (the sodium ion and the acetate ion). We might also have been given just the negatively-charged acetate instead of the neutral acetic acid.

Tautomers

Another important chemical phenomenon exists where apparently different molecules with identical heavy atoms and a nearby hydrogen can be easily interconverted: tautomers. By moving just one hydrogen atom and exchanging adjacent bond orders, the molecule can convert from one form to another. Usually, one tautomeric form is most stable. Warfarin, a blood-thinning drug, can exist in solution in 40 distinct tautomeric forms. A famous example is keto-enol tautomerism: for example, ethenol (not ethanol) can interconvert with the ketone form. When one form is more stable than the other form(s), we need to make sure we convert the less stable form(s) into the most stable form. Ethenol, a.k.a. vinyl alcohol, (SMILES: ‘C=CO[H]’), will be more stable when it is in the ketone form (SMILES: ‘CC(=O)([H])’):

from IPython.display import SVG # to use Scalar Vector Graphics (SVG) not bitmaps, for cleaner lines

import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw # to draw molecules
from rdkit.Chem.Draw import IPythonConsole # to draw inline in iPython
from rdkit.Chem import rdDepictor  # to generate 2D depictions of molecules
from rdkit.Chem.Draw import rdMolDraw2D # to draw 2D molecules using vectors

AllChem.ReactionFromSmarts('[C:1]-[C:2](-[O:3]-[H:4])>>[C:1]-[C:2](=[O:3])(-[H:4])')

The following code derives from Greg Landrum and JP Ebejer, with two variants of the method: one that expects a SMILES string, and another that needs an RDKit molecule.

The provenance of the molecular standardization code began with Matt Swain’s Python code, MolVS, which my former DPhil student, Dr Susan Leung, translated to C++ in RDKit while working with Greg Landrum on a Google Summer of Code project.

If you call standardize_mol() with verbose=True, you can see the steps involved in standardizing the molecule and how the input molecule changes.

def standardize_mol(mol, verbose=False):
    """Standardize the RDKit molecule, select its parent molecule, uncharge it, 
    then enumerate all the tautomers.
    If verbose is true, an explanation of the steps and structures of the molecule
    as it is standardized will be output."""
    # Follows the steps from:
    #  https://github.com/greglandrum/RSC_OpenScience_Standardization_202104/blob/main/MolStandardize%20pieces.ipynb
    # as described **excellently** (by Greg Landrum) in
    # https://www.youtube.com/watch?v=eWTApNX8dJQ -- thanks JP!
    
    from rdkit.Chem.MolStandardize import rdMolStandardize
    # removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule
    clean_mol = rdMolStandardize.Cleanup(mol) 
    if verbose:
        print('Remove Hs, disconnect metal atoms, normalize the molecule, reionize the molecule:')
        draw_mol_with_SVG(clean_mol)

    # if many fragments, get the "parent" (the actual mol we are interested in) 
    parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
    if verbose:
        print('Select the "parent" fragment:')
        draw_mol_with_SVG(parent_clean_mol)

    # try to neutralize molecule
    uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists
    uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)
    if verbose:
        print('Neutralize the molecule:')
        draw_mol_with_SVG(uncharged_parent_clean_mol)

    # Note: no attempt is made at reionization at this step
    # nor ionization at some pH (RDKit has no pKa caculator);
    # the main aim to to represent all molecules from different sources
    # in a (single) standard way, for use in ML, catalogues, etc.
    te = rdMolStandardize.TautomerEnumerator() # idem
    taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)
    if verbose:
        print('Enumerate tautomers:')
        draw_mol_with_SVG(taut_uncharged_parent_clean_mol)
    assert taut_uncharged_parent_clean_mol != None
    
    if verbose: print(Chem.MolToSmiles(taut_uncharged_parent_clean_mol))
    return taut_uncharged_parent_clean_mol


def standardize_smiles(smiles, verbose=False):
  """Standardize the SMILES string, select its parent molecule, uncharge it, 
    then enumerate all the tautomers."""
  if verbose: print(smiles)
  std_mol = standardize_mol(Chem.MolFromSmiles(smiles), verbose=verbose)
  return Chem.MolToSmiles(std_mol)


def draw_mol_with_SVG(mol, molSize=(450,150)):
    """Use SVG to draw an RDKit molecule, mol."""
    mc = Chem.Mol(mol.ToBinary())
    if not mc.GetNumConformers():        
        rdDepictor.Compute2DCoords(mc) # Compute 2D coordinates
    # Initialize the drawer with the size
    drawer = rdMolDraw2D.MolDraw2DSVG(molSize[0],molSize[1])
    drawer.DrawMolecule(mc) # Draw the molcule
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText() # Get the SVG string
    display(SVG(svg.replace('svg:',''))) # Fix the SVG string and display

Let’s try standardizing acetic acid. We will use three forms: acetic acid, acetate, and sodium acetate. First, let’s build the molecules and draw them:

acetic_acid_smiles = ['CC(=O)O',     'CC(=O)[O-]', '[Na+].CC(=O)[O-]']
acetic_acid_names  = ['acetic acid', 'acetate',    'sodium acetate']
acetic_acid_mols   = [Chem.MolFromSmiles(s) for s in acetic_acid_smiles]

Draw.MolsToGridImage(acetic_acid_mols, molsPerRow=3, subImgSize=(150,150), legends=['  '+n+'  ' for n in acetic_acid_names], useSVG=True)

Now let’s standardize each version: acetic acid, acetate, and sodium acetate:

standardize_smiles(acetic_acid_smiles[0], verbose=True)

What happens if we standardize acetate?

standardize_smiles(acetic_acid_smiles[1], verbose=True)

What if we standardize sodium acetate?

standardize_smiles(acetic_acid_smiles[2], verbose=True)

As we can see, our code has transformed all three forms into the same molecule, neutral acetic acid.

We could also write a variant of our standardize_mol code that re-ionizes the molecule, by calling rdMolStandardize.Reionize(mol), instead of creating an uncharger and calling uncharger.uncharge(parent_clean_mol).

Let’s look at tautomerism: let’s start with the enol-form of our molecule, prop-1-en-2-ol (C3H6O):

smi = 'C=C(O)C' # prop-1-en-2-ol
std_smi = standardize_smiles(smi, verbose=True)

Let’s see what happens if we start with the keto-form, propan-2-one; note that it has exactly the same chemical formula, (C3H6O):

smi ='CC(=O)C' # propan-2-one
std_smi = standardize_smiles(smi, verbose=True)

We can see that the molecular standardization process has transformed both the enol and the keto forms into the same endpoint: the keto-form.

We have only scratched the surface of this topic; you can read more in Richard Apodaca’s Depth-First post. If we care about creating accurate machine learning models from molecules and molecular graphs, we need to take care to prepare the highest quality input data, and molecular standardization is one aspect of molecular data science that is too often overlooked.

Happy standardizing!

Author