The workings of Fragmenstein’s RDKit neighbour-aware minimisation

Fragmenstein is a Python module that combine hits or position a derivative following given templates by being very strict in obeying them. This is done by creating a “monster”, a compound that has the atomic positions of the templates, which then reanimated by very strict energy minimisation. This is done in two steps, first in RDKit with an extracted frozen neighbourhood and then in PyRosetta within a flexible protein. The mapping for both combinations and placements are complicated, but I will focus here on a particular step the minimisation, primarily in answer to an enquiry, namely how does the RDKit minimisation work.

Normal run

Choosing as a test the data from Gahbauer et al. 2023 (where Fragmenstein is used to find an awesome hit), I will do a normal merge of the hits from PDB:5RSW and PDB:5RUE, while placing PDB:5SQJ against these. Parenthetically, I do not use the word “dock” as it does not test MCMC rototranslations of pre-generating conformers and instead creates a conformer and fixes it with not conformer resampling —the PyRosetta part uses a custom FastRelax script with no rotamers and with extra constraints.
In the footnote below the data is prepared.
In the footnote a dataclass instance with the three structures was created with values accession, clean_pdbblock and mol. Now let’s combine them.

from fragmenstein import Victor, Igor

apo_pdbblock = '\n'.join([l for l in template1.clean_pdbblock.split('\n') if 'ATOM' in l])

Igor.init_pyrosetta()
vicky = Victor([template1.mol, template2.mol], pdb_block=apo_pdbblock)
vicky.combine(long_name='merger')
print(vicky.summarize())

The latter gave a potential of -14.1 kcal/mol with 23 constrained atoms and none unconstrained and an RMSD of 0.23Å and run in 17s. Sounds great, so let’s have a gander. Parenthetically, Fragmenstein has nglview display options, but unfortunately this no longer works in Colab and with newer IPython widget module.

view = py3Dmol.view(width=800, height=400)
view.addModel(apo_pdbblock, 'pdb')
view.setStyle({'chain': 'A'}, {'cartoon': {'color': 'turquoise'}})
for hit in vicky.hits:
    view.addModel(Chem.MolToMolBlock(hit), 'mol')
    view.setStyle({'model': -1}, {'stick': {'colorscheme': 'whiteCarbon'}})
view.addModel(Chem.MolToMolBlock(vicky.minimized_mol), 'mol')
view.setStyle({'model': -1}, {'stick': {'colorscheme': 'yellowCarbon'}})
view.zoomTo({'model': -1})
view.show()

Now let’s place. Parenthetically, I should say that I use Fragmenstein like this when testing hypotheses, for large scale mergers I submit HPC jobs running the fragmenstein pipeline.

vicky = Victor([template1.mol, template2.mol], pdb_block=apo_pdbblock)
vicky.place(derivative.SMILES, long_name='placed')

The placement worked well (1Å RMSD), although I ought to mention that Laboratory (the pipeline, expand_isomers of .place) makes Victor try all stereoisomers, as a result the resulting stereoisomer will likely be a product of the monster step. About the classes, the class Monster handles the making of the stitched together compound, Igor the PyRosetta minimisation (Fritz the OpenMM minimisation), Victor orchestrates the two, while Laboratory does multiple mergers or/and placements. There is also Walter which sets up the scene with synthetic tests —although it is .

But what if 5RSW did not exist? What happens? Let’s throw Fragmenstein under the bus as it has no idea how to best place the other part.

This is because in order to compromise with the atomic positions especially between mismatching templates it does not do a easy constrained embedding via RDKit AllChem.EmbedMolecule or even AllChem.ConstrainedEmbed, as this would fail. Parenthetically, a water could be provided as a hit along with a simple map override. Sometimes, the list in Victor(...).monster.mol_options might have a satisfactory solution.
However, here we want to dissect the workings of Monster. Truth be told, the above is actually a third run of the code as the first run gave 0.9Å relative to the actual conformer of the derivative compound (below) and the second 1.2Å, both of which are good. The fact that each run gives a slightly different outcome for the unconstrained part is rather meaningless without knowing the crystallographic conformer: one needs to balance risk when shortlisting compounds and, if that second hit or the charged native ligand did not exist, few scientists would be that profligate to take that gamble.

Neighbourhood

Monster can do a RDKit based minimisation using MMFF in a frozen neighbourhood.
This takes a fraction of a second, but is imperfect so a second minimisation in the protein is done. It is imperfect as it in vacuo, the neighbourhood is fixed and the score is internal energy. The latter is independent of entropy whereas in PyRosetta the hybrid score function is better at accounting for this as its not pure physics-based and instead has fitted values (fudge factors in purist lingo) even if it is in implicit solvent (Lazaridis-Karplus solvation potential) and the torsions are not explicitly there —for a quick overview of thermodynamics see this previous post).

On the operations, the first step is getting the neighbourhood.
Victor, which handles the apo PDB block, calls monster.get_neighborhood(apo, cutoff, mol) with mol being optional as otherwise monster.positioned_mol is used (code). The PDB block is read into RDKit (it lacked altLoc see footnote) and then the method gets every atom in the protein Chem.Mol within a atom-to-atom cutoff distance and then expands any aromatic atoms to the full aromatic ring.

# Extract the neighbourhood!
neigh: Chem.Mol = vicky.monster.get_neighborhood(vicky.apo_pdbblock, 8.)
# `Monster.get_neighborhood(None, vicky.apo_pdbblock, 8., mol)` would have done the same

# view it
view = py3Dmol.view(width=800, height=400)
view.addModel(Chem.MolToMolBlock(neigh), 'mol')
view.setStyle({'model': -1}, {'stick': {'colorscheme': 'cyanCarbon'}})
view.addModel(Chem.MolToMolBlock(vicky.monster.positioned_mol), 'mol')
view.setStyle({'model': -1}, {'stick': {'colorscheme': 'magentaCarbon'}})
view.zoomTo({'model': -1})
view.show()

Minimise

The next step in minimisation where the neighbourhood atoms are, in molecular mechanics parlance, “frozen” (i.e. unable to move). Parenthetically, crystallographers use the word “constrained” for this and “restrained” for a constrained atom (say with a harmonic tether): personally I feel it is best avoid using that terminology.
The minimisation is done by monster.mmff_minimize, which substitutes the dummy atoms for a carbon (which would crash otherwise), then the following happens (with some extra steps such as reruns):

# Given
atom_indices_to_freeze: List[int]
atom_indices_to_tether: List[int]
mol: Chem.Mol
ff_max_displacement: float
ff_constraint: float

# Run
p: AllChem.ForceField.MMFFMolProperties = AllChem.MMFFGetMoleculeProperties(mol, 'MMFF94')
ff: AllChem.ForceField.ForceField = AllChem.MMFFGetMoleculeForceField(mol, p)
ff.Initialize()
for i in atom_indices_to_freeze:
      ff.AddFixedPoint(i)
for i in atom_indices_to_tether:
      ff.MMFFAddPositionConstraint(i, maxDispl=ff_max_displacement, forceConstant=ff_constraint)
ff.Minimize()
ff.CalcEnergy()

Looking at the C-code for the function ff.MMFFAddPositionConstraint, one sees that this is a harmonic force: force_constant times the square of the difference between distance and max_displacement_constant (zero if distance is less than max_displacement_constant).
The whole operating acts in place on the Chem.Mol.

The last step is extracting the compound from the combined mol–neighbourhood system with is done via monster.extract_from_neighborhood. The latter actually removes the neighbourhood before calling Chem.GetMolFrags because when things go super-duper badly wrong there may be new bonds.

Example

So lets use an externally generated conformers and find the best using this code.
First, let’s fix the template and make a bunch of embedded conformers of the derivative constrained against it (below target). Fragmenstein is fine with imperfect matches, but for this walkthrough we will use Chem.Mol.GetSubstructMatch.

from fragmenstein import Monster

# no hydroxy and carboxy > amide in target
monstah = Monster([template1.mol])
monstah.place_smiles('Nc1ccc(C(=O)N)cc1')
mod_template: Chem.Mol = monstah.positioned_mol

Next let’s generate conformers but with certain atoms fixed. Note that AllChem.EmbedMultipleConfs will drift so needs realigning.

target = Chem.MolFromSmiles( derivative.SMILES )
mod_conf: Chem.Conformer = mod_template.GetConformer()
coords = {ti: mod_conf.GetAtomPosition(qi) for qi, ti in enumerate(target.GetSubstructMatch( mod_template ))}
AllChem.EmbedMultipleConfs(target, coordMap=coords, numConfs=10)
atom_map = [(ti, qi) for qi, ti in enumerate(fauxstah.positioned_mol.GetSubstructMatch(mod_template))]
# fix drift
for ci in range(target.GetNumConformers()):
    assert rdMolAlign.AlignMol(target, mod_template, prbCid=ci, atomMap=atom_map) < .5 # By definition this will never happen, but do be aware of rounding errors

So let’s minimise!

from typing import List
from rdkit.Chem import rdMolAlign
from fragmenstein import MinizationOutcome  # just a dataclass for typehinting

# make a faux monster
fauxstah = Monster.__new__(Monster)
fauxstah.positioned_mol = Chem.Mol(target)
conf: Chem.Conformer
solutions: List[MinizationOutcome] = []
for conf in target.GetConformers():
    fauxstah.positioned_mol.RemoveAllConformers()
    fauxstah.positioned_mol.AddConformer( conf )
    neighborhood: Chem.Mol = fauxstah.get_neighborhood(apo_pdbblock, cutoff=8.)
    out = fauxstah.mmff_minimize(fauxstah.positioned_mol,
                            neighborhood=neighborhood,
                            ff_max_displacement=0.1,
                            ff_constraint=5.0,
                            allow_lax=True)
    solutions.append(out)

solutions = sorted(solutions, key=lambda mo: mo.U_post)
# let's cheat! RMSD against the actual crystallographic one.
for opt in [opt.mol for opt in solutions]:
    print(rdMolAlign.CalcRMS(derivative.mol, opt))

The third entry of the internal energy sorted conformers is the closest (1.0Å). Obviously, this is cheating as that info would not be available.
Let’s have a gander of the most similar:

The solutions from the conformer hack

Footnote: data generation

Fragmenstein in the demo submodule already has some examples, but here I want to operate from first principles.

First define the accession codes, lets use a dataclass instance to hold the data:

from dataclasses import dataclass
from rdkit import Chem

@dataclass
class Template:
    accession: str
    chemcomp: str
    SMILES: str
    raw_pdbblock: str = None
    clean_pdbblock: str = None
    mol: Chem.Mol = None

derivative = Template(accession='5SQJ', chemcomp='QRU', SMILES='c1cc2c(c(c1)O)C[C@H]([C@H]2NC(=O)c3ccc(cc3)NC(=O)NC4CC4)C(=O)[O-]')
template1 = Template(accession='5RUE', chemcomp='BHA', SMILES='c1cc(c(cc1N)O)C(=O)[O-]')
template2 = Template(accession='5RSW', chemcomp='6FZ', SMILES='c1ccc2c(c1)CC(C2)C(=O)[O-]')

Second, let’s retrieve the PDB blocks and remove pesky altloc (RDKit hates them):

import requests
from typing import Callable

def remove_altloc(pdbblock: str) -> str:
    """Removes all altlocs and actually segi duplicates.
    Test line:
    
    ATOM      1  N   MET A   1       0.000  0.000  0.000  1.00 60.69           N
    """
    lines: List[str] = []
    seen: List[Tuple[str, str, int]] = []
    for line in pdbblock.split('\n'):
        if 'ANISOU' in line:
            continue # skip
        if line[:4] != 'ATOM' and line[:6] != 'HETATM':
            lines.append(line)
            continue
        atom_info = line[12:16].strip(), line[21].strip(), int(line[22:26].strip())
        if atom_info not in seen:
            lines.append(f'{line[:16]} {line[17:]}')
            seen.append(atom_info)
        else: # skip
            pass
    return '\n'.join(lines)

def fetch(pdb_acc: str) -> str:
    response: requests.Response = requests.get(f'https://files.rcsb.org/download/{pdb_acc}.pdb')
    response.raise_for_status()
    return response.text

model: Template
for model in (derivative, template1, template2):
    model.raw_pdbblock=fetch(model.accession)
    model.clean_pdbblock = remove_altloc( model.raw_pdbblock)

Let’s extract the hits

from fragmenstein import Monster, Victor

for model in (template1, template2, derivative):
    model.mol = Victor.extract_mol(model.accession, block=model.clean_pdbblock,
                                   ligand_resn=model.chemcomp,
                                   smiles=model.SMILES,
                                   removeHs=True,
                                   proximityBonding=False,
                                   throw_on_error=True)
    model.mol.SetProp(model.chemcomp)

And lastly let’s have a gander.

import py3Dmol

def get_view(pdbblock: str, resn: str, chain='A', width=800, height=400) -> py3Dmol.view:
    view = py3Dmol.view(width=width, height=height)
    view.addModel(pdbblock, 'pdb')
    view.setStyle({'chain': chain}, {'cartoon': {'color': 'turquoise'}})
    view.setStyle({'chain': {'$ne': chain}}, {})
    view.setStyle({'within':{'distance':'5', 'sel':{'resn':resn}}}, {'stick': {}})
    view.setStyle({'resn':resn}, {'stick': {'colorscheme': 'yellowCarbon'}})
    view.zoomTo({'resn':resn})
    return view

model = derivative
view: py3Dmol.view = get_view(model.clean_pdbblock, model.chemcomp)
view.show()

Author