{"id":8946,"date":"2022-11-29T15:37:44","date_gmt":"2022-11-29T15:37:44","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=8946"},"modified":"2022-11-29T16:05:43","modified_gmt":"2022-11-29T16:05:43","slug":"how-to-turn-a-smiles-string-into-an-extended-connectivity-fingerprint-using-rdkit","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2022\/11\/how-to-turn-a-smiles-string-into-an-extended-connectivity-fingerprint-using-rdkit\/","title":{"rendered":"How to turn a SMILES string into an extended-connectivity fingerprint using RDKit"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">After my posts on <a href=\"https:\/\/www.blopig.com\/blog\/2022\/02\/how-to-turn-a-smiles-string-into-a-molecular-graph-for-pytorch-geometric\/\">how to turn a SMILES string into a molecular graph<\/a> and <a href=\"https:\/\/www.blopig.com\/blog\/2022\/06\/how-to-turn-a-molecule-into-a-vector-of-physicochemical-descriptors-using-rdkit\/\">how to turn a SMILES string into a vector of molecular descriptors <\/a> I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">ECFPs were originally described in a 2010 article of Rogers and Hahn [1] and still belong to the most popular and efficient methods to turn a molecule into an informative vectorial representation for downstream machine learning tasks. The ECFP-algorithm is dependent on two predefined hyperparameters: the fingerprint-length L and the maximum radius R. An ECFP of length L takes the form of an L-dimensional bitvector containing only 0s and 1s. Each component of an ECFP indicates the presence or absence of a particular circular substructure in the input compound. Each circular substructure has a center atom and a radius that determines its size. The hyperparameter R defines the maximum radius of any circular substructure whose presence or absence is indicated in the ECFP. Circular substructures for a central nitrogen atom in an example compound are depicted in the image below.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2022\/11\/circular_subgraphs_example-2.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2022\/11\/circular_subgraphs_example-2.png?resize=625%2C394&#038;ssl=1\" alt=\"\" class=\"wp-image-8950\" width=\"625\" height=\"394\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2022\/11\/circular_subgraphs_example-2.png?resize=1024%2C646&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2022\/11\/circular_subgraphs_example-2.png?resize=300%2C189&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2022\/11\/circular_subgraphs_example-2.png?resize=768%2C485&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2022\/11\/circular_subgraphs_example-2.png?resize=1536%2C969&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2022\/11\/circular_subgraphs_example-2.png?resize=2048%2C1292&amp;ssl=1 2048w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2022\/11\/circular_subgraphs_example-2.png?resize=624%2C394&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2022\/11\/circular_subgraphs_example-2.png?w=1250&amp;ssl=1 1250w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2022\/11\/circular_subgraphs_example-2.png?w=1875&amp;ssl=1 1875w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<!--more-->\n\n\n\n<p class=\"wp-block-paragraph\">Circular subgraphs that are structurally isomorphic are further distinguished according to their inherited atom- and bond features, i.e. two structurally isomorphic circular subgraphs with distinct atom- or bond features correspond to different components of the ECFP. For chemical bonds, this distinction is made on the basis of simple bond types: single, double, triple, or aromatic. To distinguish atoms, standard ECFPs use seven features based on the Daylight atomic invariants [2], but other versions of ECFPs that use pharmacophoric atom features also exist [1]. Optionally, the algorithm also allows for the stereochemical distinction between atoms with respect to tetrahedral chirality.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Using RDKit, a SMILES string can be transformed into an ECFP in a straightforward manner via the following function:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># import packages\nimport numpy as np\nfrom rdkit.Chem import AllChem\n\n\n# define function that transforms SMILES strings into ECFPs\ndef ECFP_from_smiles(smiles,\n                     R = 2,\n                     L = 2**10,\n                     use_features = False,\n                     use_chirality = False):\n    \"\"\"\n    Inputs:\n    \n    - smiles ... SMILES string of input compound\n    - R ... maximum radius of circular substructures\n    - L ... fingerprint-length\n    - use_features ... if false then use standard DAYLIGHT atom features, if true then use pharmacophoric atom features\n    - use_chirality ... if true then append tetrahedral chirality flags to atom features\n    \n    Outputs:\n    - np.array(feature_list) ... ECFP with length L and maximum radius R\n    \"\"\"\n    \n    molecule = AllChem.MolFromSmiles(smiles)\n    feature_list = AllChem.GetMorganFingerprintAsBitVect(molecule,\n                                                                       radius = R,\n                                                                       nBits = L,\n                                                                       useFeatures = use_features,\n                                                                       useChirality = use_chirality)\n    return np.array(feature_list)\n<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If the length L of the ECFP is chosen to be very large, then each of its dimensional components informs about the presence or absence of one particular and unambiguous circular subgraph with atom- and bond features. The associated bit is then set to 1 if and only if this circular substructure is present anywhere in the molecule, otherwise it is set to 0. However, if L becomes small then hash collisions start to occur that reduce the resolution of the ECFP; this can cause a fingerprint-component to become ambiguous and correspond to one out of several possible distinct circular subbstructures. Therefore, L must be chosen sufficiently large as to guarantee the expressivity of the ECFP. Common choices are L = 1024 or L = 2048.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the literature, ECFP-featurisations with radius R are often written in the form ECFP2R with 2R being interpreted as the maximum fingerprint-diameter. For example, the frequently used 1024-bit ECFP4-featurisation describes an ECFP with maximum radius R = 2 and length L = 1024. ECFPs are simple yet powerful molecular featurisations and I like to look at them as a non-differentiable type of message-passing graph neural network (GNN). Have fun using them for molecular machine learning!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[1] Rogers, David, and Mathew Hahn. &#8220;Extended-connectivity fingerprints.&#8221; <em>Journal of Chemical Information and Modeling<\/em> 50.5 (2010): 742-754.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[2] Weininger, David, Arthur Weininger, and Joseph L. Weininger. &#8220;SMILES. 2. Algorithm for generation of unique SMILES notation.&#8221; <em>Journal of Chemical Information and Computer Sciences<\/em> 29.2 (1989): 97-101.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>After my posts on how to turn a SMILES string into a molecular graph and how to turn a SMILES string into a vector of molecular descriptors I now complete this series by illustrating how to turn the SMILES string of a molecular compound into an extended-connectivity fingerprint (ECFP). ECFPs were originally described in a [&hellip;]<\/p>\n","protected":false},"author":84,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[633,187,29,361,189,227,221,201],"tags":[682,680,681,438,152,129,469],"ppma_author":[556],"class_list":["post-8946","post","type-post","status-publish","format-standard","hentry","category-ai","category-cheminformatics","category-code","category-data-science","category-machine-learning","category-python-code","category-python","category-small-molecules","tag-ecfp","tag-extended-connectivity-fingerprints","tag-molecular-featurisation","tag-molecular-machine-learning","tag-python","tag-rdkit","tag-smiles"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":556,"user_id":84,"is_guest":0,"slug":"markusd","display_name":"Markus Dablander","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/d0047b5862940cb3a1b68dfa3f0735d6602b1e619fb299881b56cbf60d9fd8e1?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/8946","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/84"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=8946"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/8946\/revisions"}],"predecessor-version":[{"id":8961,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/8946\/revisions\/8961"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=8946"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=8946"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=8946"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=8946"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}