{"id":7803,"date":"2022-02-16T18:05:31","date_gmt":"2022-02-16T18:05:31","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=7803"},"modified":"2022-02-21T14:14:28","modified_gmt":"2022-02-21T14:14:28","slug":"how-to-turn-a-smiles-string-into-a-molecular-graph-for-pytorch-geometric","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2022\/02\/how-to-turn-a-smiles-string-into-a-molecular-graph-for-pytorch-geometric\/","title":{"rendered":"How to turn a SMILES string into a molecular graph for Pytorch Geometric"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">Despite some of their technical <a href=\"https:\/\/www.blopig.com\/blog\/2021\/10\/issues-with-graph-neural-networks-the-cracks-are-where-the-light-shines-through\/\">issues<\/a>, graph neural networks (GNNs) are quickly being adopted as one of the state-of-the-art methods for molecular property prediction. The differentiable extraction of molecular features from low-level molecular graphs has become a viable (although not always superior) alternative to classical molecular representation techniques such as Morgan fingerprints and molecular descriptor vectors. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But molecular data usually comes in the sequential form of labeled SMILES strings. It is not obvious for beginners how to optimally transform a SMILES string into a structured molecular graph object that can be used as an input for a GNN. In this post, we show how to convert a SMILES string into a molecular graph object which can subsequently be used for graph-based machine learning. We do so within the framework of <strong>Pytorch Geometric<\/strong> which currently is one of the best and most commonly used Python-based GNN-libraries.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We divide our task into three high-level steps:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>We define a function that maps an RDKit atom object to a suitable atom feature vector.<\/li><li>We define a function that maps an RDKit bond object to a suitable bond feature vector.<\/li><li>We define a function that takes as its input a list of SMILES strings and associated labels and then uses the functions from 1.) and 2.) to create a list of labeled Pytorch Geometric graph objects as its output.<\/li><\/ol>\n\n\n\n<!--more-->\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 0: Import Packages<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As always, we first import the necessary Python packages for our endeavour:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># import packages\n\n# general tools\nimport numpy as np\n\n# RDkit\nfrom rdkit import Chem\nfrom rdkit.Chem.rdmolops import GetAdjacencyMatrix\n\n# Pytorch and Pytorch Geometric\nimport torch\nfrom torch_geometric.data import Data\nfrom torch.utils.data import DataLoader<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 1: Atom Featurisation<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We start by defining an auxiliary function which transforms a value x into a one-hot encoding based on a list of permitted values for x:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def one_hot_encoding(x, permitted_list):\n    \"\"\"\n    Maps input elements x which are not in the permitted list to the last element\n    of the permitted list.\n    \"\"\"\n\n    if x not in permitted_list:\n        x = permitted_list[-1]\n\n    binary_encoding = [int(boolean_value) for boolean_value in list(map(lambda s: x == s, permitted_list))]\n\n    return binary_encoding<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now we use this auxiliary function to define the actual atom featurisation function:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def get_atom_features(atom, \n                      use_chirality = True, \n                      hydrogens_implicit = True):\n    \"\"\"\n    Takes an RDKit atom object as input and gives a 1d-numpy array of atom features as output.\n    \"\"\"\n\n    # define list of permitted atoms\n    \n    permitted_list_of_atoms =  ['C','N','O','S','F','Si','P','Cl','Br','Mg','Na','Ca','Fe','As','Al','I', 'B','V','K','Tl','Yb','Sb','Sn','Ag','Pd','Co','Se','Ti','Zn', 'Li','Ge','Cu','Au','Ni','Cd','In','Mn','Zr','Cr','Pt','Hg','Pb','Unknown']\n    \n    if hydrogens_implicit == False:\n        permitted_list_of_atoms = ['H'] + permitted_list_of_atoms\n    \n    # compute atom features\n    \n    atom_type_enc = one_hot_encoding(str(atom.GetSymbol()), permitted_list_of_atoms)\n    \n    n_heavy_neighbors_enc = one_hot_encoding(int(atom.GetDegree()), [0, 1, 2, 3, 4, \"MoreThanFour\"])\n    \n    formal_charge_enc = one_hot_encoding(int(atom.GetFormalCharge()), [-3, -2, -1, 0, 1, 2, 3, \"Extreme\"])\n    \n    hybridisation_type_enc = one_hot_encoding(str(atom.GetHybridization()), [\"S\", \"SP\", \"SP2\", \"SP3\", \"SP3D\", \"SP3D2\", \"OTHER\"])\n    \n    is_in_a_ring_enc = [int(atom.IsInRing())]\n    \n    is_aromatic_enc = [int(atom.GetIsAromatic())]\n    \n    atomic_mass_scaled = [float((atom.GetMass() - 10.812)\/116.092)]\n    \n    vdw_radius_scaled = [float((Chem.GetPeriodicTable().GetRvdw(atom.GetAtomicNum()) - 1.5)\/0.6)]\n    \n    covalent_radius_scaled = [float((Chem.GetPeriodicTable().GetRcovalent(atom.GetAtomicNum()) - 0.64)\/0.76)]\n\n    atom_feature_vector = atom_type_enc + n_heavy_neighbors_enc + formal_charge_enc + hybridisation_type_enc + is_in_a_ring_enc + is_aromatic_enc + atomic_mass_scaled + vdw_radius_scaled + covalent_radius_scaled\n                                    \n    if use_chirality == True:\n        chirality_type_enc = one_hot_encoding(str(atom.GetChiralTag()), [\"CHI_UNSPECIFIED\", \"CHI_TETRAHEDRAL_CW\", \"CHI_TETRAHEDRAL_CCW\", \"CHI_OTHER\"])\n        atom_feature_vector += chirality_type_enc\n    \n    if hydrogens_implicit == True:\n        n_hydrogens_enc = one_hot_encoding(int(atom.GetTotalNumHs()), [0, 1, 2, 3, 4, \"MoreThanFour\"])\n        atom_feature_vector += n_hydrogens_enc\n\n    return np.array(atom_feature_vector)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To encapsulate as much information as possible within the molecular graph, we include a plethora of atomic features: atom type, number of heavy atom neighbours, formal charge, hybridisation type, whether the atom is in a ring, whether the atom is aromatic, atomic mass, Van der Waals radius, and covalent radius. The last three properties are numerical in nature and are thus automatically scaled to a reasonable range using empirically estimated quantities. The user can explicitly specify whether to use chirality as a stereochemical feature and whether to treat hydrogen atoms implicitly or explicitly.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 2: Bond Featurisation<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now that we have constructed a function to conveniently turn RDKit atom objects into feature vectors, we define an analogous function for RDKit bond objects:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def get_bond_features(bond, \n                      use_stereochemistry = True):\n    \"\"\"\n    Takes an RDKit bond object as input and gives a 1d-numpy array of bond features as output.\n    \"\"\"\n\n    permitted_list_of_bond_types = [Chem.rdchem.BondType.SINGLE, Chem.rdchem.BondType.DOUBLE, Chem.rdchem.BondType.TRIPLE, Chem.rdchem.BondType.AROMATIC]\n\n    bond_type_enc = one_hot_encoding(bond.GetBondType(), permitted_list_of_bond_types)\n    \n    bond_is_conj_enc = [int(bond.GetIsConjugated())]\n    \n    bond_is_in_ring_enc = [int(bond.IsInRing())]\n    \n    bond_feature_vector = bond_type_enc + bond_is_conj_enc + bond_is_in_ring_enc\n    \n    if use_stereochemistry == True:\n        stereo_type_enc = one_hot_encoding(str(bond.GetStereo()), [\"STEREOZ\", \"STEREOE\", \"STEREOANY\", \"STEREONONE\"])\n        bond_feature_vector += stereo_type_enc\n\n    return np.array(bond_feature_vector)<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The bond features we consider in the above function are: bond type, whether the bond is conjugated, and whether the bond is in a ring. As an additional option, the user can specify whether to include E-Z stereochemical features around double bonds. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 3: Generating labeled Pytorch Geometric Graph Objects<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Equipped with suitable functions to turn RDKit atom objects and RDKit bond objects into informative feature vectors, we swiftly move on to define a function which turns a list of SMILES strings and an associated list of labels (such as pKi values) into a list of Pytorch Geometric graph objects:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">def create_pytorch_geometric_graph_data_list_from_smiles_and_labels(x_smiles, y):\n    \"\"\"\n    Inputs:\n    \n    x_smiles = [smiles_1, smiles_2, ....] ... a list of SMILES strings\n    y = [y_1, y_2, ...] ... a list of numerial labels for the SMILES strings (such as associated pKi values)\n    \n    Outputs:\n    \n    data_list = [G_1, G_2, ...] ... a list of torch_geometric.data.Data objects which represent labeled molecular graphs that can readily be used for machine learning\n    \n    \"\"\"\n    \n    data_list = []\n    \n    for (smiles, y_val) in zip(x_smiles, y):\n        \n        # convert SMILES to RDKit mol object\n        mol = Chem.MolFromSmiles(smiles)\n\n        # get feature dimensions\n        n_nodes = mol.GetNumAtoms()\n        n_edges = 2*mol.GetNumBonds()\n        unrelated_smiles = \"O=O\"\n        unrelated_mol = Chem.MolFromSmiles(unrelated_smiles)\n        n_node_features = len(get_atom_features(unrelated_mol.GetAtomWithIdx(0)))\n        n_edge_features = len(get_bond_features(unrelated_mol.GetBondBetweenAtoms(0,1)))\n\n        # construct node feature matrix X of shape (n_nodes, n_node_features)\n        X = np.zeros((n_nodes, n_node_features))\n\n        for atom in mol.GetAtoms():\n            X[atom.GetIdx(), :] = get_atom_features(atom)\n            \n        X = torch.tensor(X, dtype = torch.float)\n        \n        # construct edge index array E of shape (2, n_edges)\n        (rows, cols) = np.nonzero(GetAdjacencyMatrix(mol))\n        torch_rows = torch.from_numpy(rows.astype(np.int64)).to(torch.long)\n        torch_cols = torch.from_numpy(cols.astype(np.int64)).to(torch.long)\n        E = torch.stack([torch_rows, torch_cols], dim = 0)\n        \n        # construct edge feature array EF of shape (n_edges, n_edge_features)\n        EF = np.zeros((n_edges, n_edge_features))\n        \n        for (k, (i,j)) in enumerate(zip(rows, cols)):\n            \n            EF[k] = get_bond_features(mol.GetBondBetweenAtoms(int(i),int(j)))\n        \n        EF = torch.tensor(EF, dtype = torch.float)\n        \n        # construct label tensor\n        y_tensor = torch.tensor(np.array([y_val]), dtype = torch.float)\n        \n        # construct Pytorch Geometric data object and append to data list\n        data_list.append(Data(x = X, edge_index = E, edge_attr = EF, y = y_tensor))\n\n    return data_list<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Training Loop and Summary<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this post, we learned how to turn RDKit atoms and RDKit bonds into meaningful feature vectors and how to use these feature vectors to create molecular graph objects that can be used as inputs to Pytorch Geometric GNNs. A canonical GNN training loop could now look like this:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"generic\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># canonical training loop for a Pytorch Geometric GNN model gnn_model\n\n# create list of molecular graph objects from list of SMILES x_smiles and list of labels y\ndata_list = create_pytorch_geometric_graph_data_list_from_smiles_and_labels(x_smiles, y)\n\n# create dataloader for training\ndataloader = DataLoader(dataset = data_list, batch_size = 2**7)\n\n# define loss function\nloss_function = nn.MSELoss()\n\n# define optimiser\noptimiser = torch.optim.Adam(gnn_model.parameters(), lr = 1e-3)\n\n# loop over 10 training epochs\nfor epoch in range(10):\n\n    # set model to training mode\n    gnn_model.train()\n\n    # loop over minibatches for training\n    for (k, batch) in enumerate(dataloader):\n\n        # compute current value of loss function via forward pass\n        output = gnn_model(batch)\n        loss_function_value = loss_function(output[:,0], torch.tensor(batch.y, dtype = torch.float32))\n\n        # set past gradient to zero\n        optimiser.zero_grad()\n\n        # compute current gradient via backward pass\n        loss_function_value.backward()\n\n        # update model weights using gradient and optimisation method\n        optimiser.step()<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Happy training!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Despite some of their technical issues, graph neural networks (GNNs) are quickly being adopted as one of the state-of-the-art methods for molecular property prediction. The differentiable extraction of molecular features from low-level molecular graphs has become a viable (although not always superior) alternative to classical molecular representation techniques such as Morgan fingerprints and molecular descriptor [&hellip;]<\/p>\n","protected":false},"author":84,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[187,29,361,189,227,221,201],"tags":[431,360,172,152,471,129,134,469],"ppma_author":[556],"class_list":["post-7803","post","type-post","status-publish","format-standard","hentry","category-cheminformatics","category-code","category-data-science","category-machine-learning","category-python-code","category-python","category-small-molecules","tag-graph-neural-networks","tag-graphs","tag-machine-learning","tag-python","tag-pytorch","tag-rdkit","tag-small-molecules","tag-smiles"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":556,"user_id":84,"is_guest":0,"slug":"markusd","display_name":"Markus Dablander","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/d0047b5862940cb3a1b68dfa3f0735d6602b1e619fb299881b56cbf60d9fd8e1?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Dablander","first_name":"Markus","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/7803","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/84"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=7803"}],"version-history":[{"count":6,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/7803\/revisions"}],"predecessor-version":[{"id":8947,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/7803\/revisions\/8947"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=7803"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=7803"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=7803"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=7803"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}