Extracting 3D Pharmacophore Points with RDKit | Oxford Protein Informatics Group

Pharmacophores are simplified representations of the key interactions ligands make with proteins, such as hydrogen bonds, charge interactions, and aromatic contacts. Think of them as the essential “bumps and grooves” on a key that allow it to fit its lock (the protein). These maps can be derived from ligands or protein–ligand complexes and are powerful tools for virtual screening and generative models. Here, we’ll see how to extract 3D pharmacophore points from a ligand using RDKit.
(Code adapted from Dr. Ruben Sanchez.)

Why pharmacophore “points”?

RDKit represents each pharmacophore feature (donor, acceptor, aromatic, etc.) as a point in 3D space, located at the feature center. These points capture the essential interaction motifs of a ligand without requiring the full atomic detail.

Core idea: RDKit’s FeatureFactory

So, how does this work? RDKit uses a definition file (BaseFeatures.fdef) that contains SMARTS patterns to identify common pharmacophore features like donors, acceptors, and aromatic rings.
The ChemicalFeatures.BuildFeatureFactory class reads this file and creates a feature factory. This factory can then scan a molecule with a 3D conformation and identify all matching features.
- Family: The type of feature (e.g., “Donor”, “Acceptor”).
- Atom indices: The specific atoms in the molecule that define the feature.
- 3D position: A single point in 3D space representing the feature’s geometric center.

The function

from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
from collections import defaultdict
from typing import Iterable, Dict, List, Tuple

# Define which pharmacophore families we want to extract

PHARMACOPHORE_FAMILES_TO_KEEP = ('Donor', 'Acceptor','Aromatic')

_FEATURES_FACTORY = []

def get_features_factory(features_names, resetPharmacophoreFactory=False):
    global _FEATURES_FACTORY, _FEATURES_NAMES
    if resetPharmacophoreFactory or (len(_FEATURES_FACTORY) > 0 and _FEATURES_FACTORY[-1] != features_names):
        _FEATURES_FACTORY.pop()
        _FEATURES_FACTORY.pop()
    if len(_FEATURES_FACTORY) == 0:
        feature_factory = AllChem.BuildFeatureFactory(os.path.join(RDConfig.RDDataDir, 'BaseFeatures.fdef'))
        _FEATURES_NAMES = features_names
        if features_names is None:
            features_names = list(feature_factory.GetFeatureFamilies())

        _FEATURES_FACTORY.extend([feature_factory, features_names])
    return _FEATURES_FACTORY


def getPharmacophoreCoords(mol, features_names: Iterable[str] = PHARMACOPHORE_FAMILES_TO_KEEP, confId:int=-1) -> \
        Tuple[Dict[str, np.ndarray],  Dict[str, List[np.ndarray]]] :

    """
    Extracts 3D pharmacophore points from a molecule.

    Args:
        mol: An RDKit molecule with an embedded 3D conformer.
        families: A tuple of pharmacophore family names to extract.
        confId: The conformer ID to use.

    Returns:
        A dictionary containing the coordinates ('coords'), family types ('families'),
        and atom indices ('indices') for each pharmacophore point.
    """

    feature_factory, keep_featnames = get_features_factory(features_names)
    rawFeats = feature_factory.GetFeaturesForMol(mol, confId=confId)
    featsDict = defaultdict(list)
    idxsDict = defaultdict(list)
    allCoords = defaultdict(list)

    for f in rawFeats:
        if f.GetFamily() in keep_featnames:
            featsDict[f.GetFamily()].append(np.array(f.GetPos(confId=f.GetActiveConformer())))
            idxsDict[f.GetFamily()].append(np.array(f.GetAtomIds()))
            allCoords['Coords'].append(np.array(f.GetPos(confId=f.GetActiveConformer())))
            allCoords['Family'].append(f.GetFamily())

    new_feats_dict = {}
    for key in featsDict:
        new_feats_dict[key] = np.concatenate(featsDict[key]).reshape((-1,3))
    return new_feats_dict, idxsDict, allCoords

What it returns

new_feats_dict : Maps each feature family to a NumPy array of its 3D coordinates.
idxsDict : Maps each feature family to a list of the atom indices that define each feature.
allCoords Contains two flat lists, one for all feature coordinates and one for their corresponding families.

End-to-end example

Now, let’s use our function in a complete workflow.
1. First, we need a molecule with a 3D structure.
2. Next, we call our function and extract Pharmacophore Points
3. Let’s see what we found.

from rdkit import Chem, RDConfig
from rdkit.Chem import AllChem
import numpy as np
import os.path

# 1) Build a 3D molecule
smiles = "c1cc(ccc1O)C(=O)NC"  # example ligand
mol = Chem.AddHs(Chem.MolFromSmiles(smiles))
AllChem.EmbedMolecule(mol, AllChem.ETKDGv3())
AllChem.UFFOptimizeMolecule(mol)

# 2) Extract pharmacophore points
new_feats_dict, idxsDict, allCoords = getPharmacophoreCoords(
    mol, features_names=PHARMACOPHORE_FAMILES_TO_KEEP, confId=-1
)

# 3) Use the structured outputs
print("Families present:", list(new_feats_dict.keys()))

for fam, coords in new_feats_dict.items():
    print(f"- {fam}: {coords.shape[0]} point(s)")

Handling Overlapping Atoms

A single atom can be part of multiple features. If you need a list of unique atoms involved in any pharmacophore, you can get them from idxsDict

# Global (all families) atom index summary
all_ids = [int(i) for feats in idxsDict.values() for arr in feats for i in arr]
print("Total indices:", len(all_ids))

# Keep only unique atom indices (across all features)
pharm_indices = sorted(set(all_ids))
print(f"Unique atoms involved in features: {len(pharm_indices)}")

Saving a Subset for Visualization

What if you want to visualize only the atoms that contribute to specific features, like Donors and Acceptors? You can create a new, minimal molecule containing only those atoms and save it as an SDF file.

# 1) Pick families to save
FAMS_TO_SAVE = ("Donor", "Acceptor")

# 2) Collect unique atom indices from those families only
ids_save = []
for fam in FAMS_TO_SAVE:
    if fam in idxsDict:
        for arr in idxsDict[fam]:
            ids_save.extend(int(i) for i in arr)
ids_save = sorted(set(ids_save))

# 3) Build a minimal atom-subset molecule (no bonds for simplicity)
def build_subset_mol(mol, atom_indices):
    rw = Chem.RWMol()
    conf = mol.GetConformer()
    coords = conf.GetPositions()

    # add atoms with original Z and charge
    for old in atom_indices:
        a_old = mol.GetAtomWithIdx(old)
        a_new = Chem.Atom(a_old.GetAtomicNum())
        a_new.SetFormalCharge(a_old.GetFormalCharge())
        a_new.SetNoImplicit(True)
        rw.AddAtom(a_new)

    # add conformer with the original coordinates
    conf_new = Chem.Conformer(len(atom_indices))
    for i, old in enumerate(atom_indices):
        x, y, z = coords[old]
        conf_new.SetAtomPosition(i, (float(x), float(y), float(z)))
    rw.AddConformer(conf_new, assignId=True)

    return rw.GetMol()

subset = build_subset_mol(mol, ids_save)

# 4) Save to SDF
writer = Chem.SDWriter("donor_acceptor_atoms.sdf")
writer.write(subset)
writer.close()

print(f"Saved {len(ids_save)} atoms to donor_acceptor_atoms.sdf")

You can now open this SDF file in a molecular viewer to see exactly which atoms contribute to the key hydrogen bonding features!

Author

Yael Ziv

View all posts