How to turn a SMILES string into an extended-connectivity fingerprint using RDKit

After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).

ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.

Circular subgraphs that are structurally isomorphic are further distinguished according to their inherited atom- and bond features, i.e. two structurally isomorphic circular subgraphs with distinct atom- or bond features correspond to different components of the ECFP. For chemical bonds, this distinction is made on the basis of simple bond types: single, double, triple, or aromatic. To distinguish atoms, standard ECFPs use seven features based on the Daylight atomic invariants [2], but other versions of ECFPs that use pharmacophoric atom features also exist [1]. Optionally, the algorithm also allows for the stereochemical distinction between atoms with respect to tetrahedral chirality.

Using RDKit, a SMILES string can be transformed into an ECFP in a straightforward manner via the following function:

# import packages
import numpy as np
from rdkit.Chem import AllChem


# define function that transforms SMILES strings into ECFPs
def ECFP_from_smiles(smiles,
                     R = 2,
                     L = 2**10,
                     use_features = False,
                     use_chirality = False):
    """
    Inputs:
    
    - smiles ... SMILES string of input compound
    - R ... maximum radius of circular substructures
    - L ... fingerprint-length
    - use_features ... if false then use standard DAYLIGHT atom features, if true then use pharmacophoric atom features
    - use_chirality ... if true then append tetrahedral chirality flags to atom features
    
    Outputs:
    - np.array(feature_list) ... ECFP with length L and maximum radius R
    """
    
    molecule = AllChem.MolFromSmiles(smiles)
    feature_list = AllChem.GetMorganFingerprintAsBitVect(molecule,
                                                                       radius = R,
                                                                       nBits = L,
                                                                       useFeatures = use_features,
                                                                       useChirality = use_chirality)
    return np.array(feature_list)

If the length L of the ECFP is chosen to be very large, then each of its dimensional components informs about the presence or absence of one particular and unambiguous circular subgraph with atom- and bond features. The associated bit is then set to 1 if and only if this circular substructure is present anywhere in the molecule, otherwise it is set to 0. However, if L becomes small then hash collisions start to occur that reduce the resolution of the ECFP; this can cause a fingerprint-component to become ambiguous and correspond to one out of several possible distinct circular subbstructures. Therefore, L must be chosen sufficiently large as to guarantee the expressivity of the ECFP. Common choices are L = 1024 or L = 2048.

In the literature, ECFP-featurisations with radius R are often written in the form ECFP2R with 2R being interpreted as the maximum fingerprint-diameter. For example, the frequently used 1024-bit ECFP4-featurisation describes an ECFP with maximum radius R = 2 and length L = 1024. ECFPs are simple yet powerful molecular featurisations and I like to look at them as a non-differentiable type of message-passing graph neural network (GNN). Have fun using them for molecular machine learning!

[1] Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of Chemical Information and Modeling 50.5 (2010): 742-754.

[2] Weininger, David, Arthur Weininger, and Joseph L. Weininger. “SMILES. 2. Algorithm for generation of unique SMILES notation.” Journal of Chemical Information and Computer Sciences 29.2 (1989): 97-101.

Author