{"id":4294,"date":"2018-08-29T22:53:38","date_gmt":"2018-08-29T21:53:38","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=4294"},"modified":"2018-08-30T01:26:46","modified_gmt":"2018-08-30T00:26:46","slug":"mol2vec-finding-chemical-meaning-in-300-dimensions","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2018\/08\/mol2vec-finding-chemical-meaning-in-300-dimensions\/","title":{"rendered":"Mol2vec: Finding Chemical Meaning in 300 Dimensions"},"content":{"rendered":"<p><div id=\"attachment_4297\" style=\"width: 310px\" class=\"wp-caption alignright\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2018\/08\/ci-2017-00616a_0005.jpeg?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" aria-describedby=\"caption-attachment-4297\" loading=\"lazy\" class=\"wp-image-4297 size-medium\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2018\/08\/ci-2017-00616a_0005.jpeg?resize=300%2C201&#038;ssl=1\" alt=\"Embeddings of Amino Acids\" width=\"300\" height=\"201\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2018\/08\/ci-2017-00616a_0005.jpeg?resize=300%2C201&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2018\/08\/ci-2017-00616a_0005.jpeg?resize=768%2C515&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2018\/08\/ci-2017-00616a_0005.jpeg?resize=1024%2C687&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2018\/08\/ci-2017-00616a_0005.jpeg?resize=624%2C419&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2018\/08\/ci-2017-00616a_0005.jpeg?w=1677&amp;ssl=1 1677w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2018\/08\/ci-2017-00616a_0005.jpeg?w=1250&amp;ssl=1 1250w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><p id=\"caption-attachment-4297\" class=\"wp-caption-text\">2D projections (t-SNE) of Mol2vec vectors of amino acids (bold arrows). These vectors were obtained by summing the vectors of the Morgan substructures (small arrows) present in the respective molecules (amino acids in the present example). The directions of the vectors provide a visual representation of similarities. Magnitudes reflect importance, i.e. more meaningful words. [Figure from Ref. 1]<\/p><\/div><a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing\">Natural Language Processing (NLP)<\/a> algorithms are usually used for analyzing human communication, often in the form of textual information such as scientific papers and Tweets. One aspect, coming up with a representation that clusters words with similar meanings, has been achieved very successfully with the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Word2vec\">word2vec<\/a> approach. This involves training a shallow, two-layer artificial neural network on a very large body of words and sentences \u2014 the so-called <em>corpus<\/em>&nbsp;\u2014 to generate &#8220;embeddings&#8221; of the constituent words into a high-dimensional space. By computing the vector from &#8220;woman&#8221; to &#8220;queen&#8221;, and adding it to the position of &#8220;man&#8221; in this high-dimensional space, the answer, &#8220;king&#8221;, can be found.<\/p>\n<p>A recent publication of one of my former InhibOx-colleagues, Simone Fulle, and her co-workers, Sabrina Jaeger and Samo Turk, shows how we can embed molecular substructures and chemical compounds into a similarly high-dimensional, continuous vectorial representation, which they dubbed &#8220;<a href=\"https:\/\/pubs.acs.org\/doi\/10.1021\/acs.jcim.7b00616\">mol2vec<\/a>&#8220;.<sup>1<\/sup> They also released a Python implementation, available on <a href=\"https:\/\/github.com\/samoturk\/mol2vec\">Samo Turk&#8217;s GitHub repository<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p><!--more-->The paper described how they assembled a corpus of 19.9 million molecules from <a href=\"https:\/\/www.ebi.ac.uk\/chembl\/downloads\">ChEMBL<\/a> version 23 and <a href=\"http:\/\/zinc15.docking.org\">ZINC 15<\/a>, two databases of small molecules, one storing bioactivity data, and the other commercially-available compounds; converted them into canonical SMILES strings using <a href=\"http:\/\/rdkit.org\">RDKit<\/a> 2017.03.3, and extracted the substructures that contributed to a Morgan fingerprints with radius of one. By using the gensim implementation of &nbsp;word2vec, and exploring both CBOW (Continuous Bag Of Words) and Skip-gram models, as well as various window sizes (CBOW: w=5 &amp; 10; Skip-gram: w=10 &amp; 20) and dimensional embeddings (100D and 300D), the best embedding was found using Skip-gram, w=10, and 300D.<\/p>\n<p>Mol2vec vectors were also combined with ProtVec<sup>2<\/sup>&nbsp;vectors to explore the influence of adding information about the protein (when known) to models built using chemical compound features. Such models are sometimes referred to as <em>proteochemometric models<\/em>&nbsp;(PCM).<\/p>\n<p>Using a rigorous four-level cross validation scheme, which they referred to as CV1, CV2, CV3, and CV4, they investigated the influence of knowledge about compounds, target proteins, and both, on the accuracy of their PCM models:<\/p>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li>CV1: tests performance on unknown compound-target pairs;\n<ul>\n<li>easiest scenario, because individual compounds and\/or targets (kinases) might have already been present in the training data.<\/li>\n<\/ul>\n<\/li>\n<li>CV2: tests performance on new targets by leaving out targets in training data.<\/li>\n<li>CV3: tests on unknown compounds.<\/li>\n<li>CV4: tests on unknown compounds and new targets.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Validation studies were carried out by generating models using Random Forests (RF), Gradient Boosted Machines (GBM), and Deep Neural Networks (DNN), in both regression and classification tasks:<\/p>\n<ul>\n<li>Solubility, Mutagenicity, Toxicity, and Kinase Specificity\n<ul>\n<li><em>ESOL<\/em>: predict aqueous solubility of 1,144 compounds;<\/li>\n<li><em>Ames Mutagenicity Dataset<\/em>: 3,481 mutagens + 2,990 non-mutagens;<\/li>\n<li><em>Tox21<\/em> Dataset: 12 targets associated with human toxicity, 8,192 compounds<\/li>\n<li><em>Kinase Dataset<\/em>: IC<sub>50<\/sub>, <i>K<\/i><sub>d<\/sub>, <i>K<\/i><sub>i<\/sub> ChEMBL bioassays with confidence \u2265 8,&nbsp;converted to pIC<sub>50<\/sub> values, thresholded at 6.3<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Using <i>mol2vec <\/i>embeddings as compound features, the following models were built:<\/p>\n<ul>\n<li>Random Forest from sklearn, max. feats = \u221anumber of features; balanced class weight<\/li>\n<li>Gradient Boosting Machine (GBM) from XGBoost, 2000 estimators, max. depth of trees = 3, learning rate = 0.1\n<ul>\n<li>GBMClassifier: weight of positive samples adjusted to reflect <i>n<\/i><sub>actives<\/sub>\/<i>n<\/i><sub>inactives<\/sub><\/li>\n<\/ul>\n<\/li>\n<li>Deep Neural Networks (DNN)\n<ul>\n<li>DNN with Binary FPs \u2014 MFP<i><sub>r<\/sub><\/i><sub>=2<\/sub>, 512 neurons \u2192 output layer: 1 neuron<\/li>\n<li>DNN with Continuous <i>Mol2vec<\/i> vectors \u2014 4 hidden layers, each 2000 neurons \u2192 1 neuron<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>All models were trained using 20 x 5-fold cross validations, and c<span style=\"font-size: 1rem;\">ompared using Wilcoxon signed rank tests.&nbsp;<\/span><span style=\"font-size: 1rem;\">Regression tasks were&nbsp;<\/span>evaluated using c<span style=\"font-size: 1rem;\">oefficient of determination <\/span><i style=\"font-size: 1rem;\">R<\/i><sup>2<\/sup><i style=\"font-size: 1rem;\"><sub>cv<\/sub><\/i>, Mean absolute error, MAE; and&nbsp;<span style=\"font-size: 1rem;\">Mean squared error, MSE.&nbsp;<\/span><span style=\"font-size: 1rem;\">Classification tasks were assessed using the&nbsp;<\/span><span style=\"font-size: 1rem;\">Area under the Receiver Operating Characteristic curve, AUC;&nbsp;<\/span><span style=\"font-size: 1rem;\">Sensitivity, TPR; and&nbsp;<\/span><span style=\"font-size: 1rem;\">Specificity, TNR.<\/span><\/p>\n<p>While there were some instances where mol2vec performed better than other features, and scenarios where mol2vec RF models did better than Support Vector Machines (SVM), Na\u00efve Bayes Classifiers (NBC), Convolutional Neural Networks (CNN), and DNNs, overall they concluded that:<\/p>\n<ul>\n<li><i>Mol2vec<\/i> yields state-of-the-art performance for both classification and regression on a variety of datasets.<\/li>\n<li><i>Mol2vec<\/i> with GBM seems to be very suitable for regression tasks (<i>e.g.<\/i> Ames).<\/li>\n<li><i>Mol2vec<\/i> with RF is recommended for classification tasks, including PCM models.<\/li>\n<li>Although several DNN architectures were evaluated, they were still outperformed by tree-based methods (GBM and RF).<span class=\"Apple-converted-space\">&nbsp;<\/span>\n<ul>\n<li>Their DNNs need tuning.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Perhaps the greatest advantages of 300D mol2vec over 2048-bit Morgan fingerprints are that they require less memory when training models, and are faster to train. They probably require less training data, as evidenced by the significantly better performance in predicting the aqueous solubility of 1144 compounds in the ESOL set (<i style=\"font-size: 1rem;\">R<\/i><sup>2<\/sup><i style=\"font-size: 1rem;\"><sub>cv<\/sub><\/i>-value of 0.86 for mol2vec-GBM, versus 0.66 for Morgan FP-GBM models).<\/p>\n<p>To try mol2vec out, there are some great Jupyter Notebooks \u2014 and documentation \u2014 worth looking at:<\/p>\n<ul>\n<li><a href=\"https:\/\/github.com\/samoturk\/mol2vec_notebooks\/tree\/master\/Notebooks\">https:\/\/github.com\/samoturk\/mol2vec_notebooks\/tree\/master\/Notebooks<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/samoturk\/mol2vec\/tree\/master\/examples\">https:\/\/github.com\/samoturk\/mol2vec\/tree\/master\/examples<\/a><\/li>\n<li><a href=\"https:\/\/mol2vec.readthedocs.io\/en\/latest\/index.html?highlight=features\">https:\/\/mol2vec.readthedocs.io\/en\/latest\/index.html?highlight=features<\/a><\/li>\n<\/ul>\n<h1>References<\/h1>\n<p>1. <a href=\"https:\/\/pubs.acs.org\/doi\/10.1021\/acs.jcim.7b00616\">Jaeger, S.; Fulle, S.; Turk, S., Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. <i>J Chem Inf Model <\/i><b>2018<\/b>, 58, 27-35.<\/a><\/p>\n<p>2. <a href=\"https:\/\/journals.plos.org\/plosone\/article?id=10.1371\/journal.pone.0141287\">Asgari, E.; Mofrad, M. R., Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. <i>PLoS One <\/i><b>2015<\/b>, 10, e0141287<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Natural Language Processing (NLP) algorithms are usually used for analyzing human communication, often in the form of textual information such as scientific papers and Tweets. One aspect, coming up with a representation that clusters words with similar meanings, has been achieved very successfully with the word2vec approach. This involves training a shallow, two-layer artificial neural [&hellip;]<\/p>\n","protected":false},"author":35,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[187,10,138,189,202,201],"tags":[24,154,57,152,129],"ppma_author":[488],"class_list":["post-4294","post","type-post","status-publish","format-standard","hentry","category-cheminformatics","category-groupmeetings","category-journal-club","category-machine-learning","category-proteins","category-small-molecules","tag-bioinformatics","tag-jupyter","tag-protein","tag-python","tag-rdkit"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":488,"user_id":35,"is_guest":0,"slug":"garrett","display_name":"Garrett","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/df625261419c37dd5c5937e37f17a732626acd6eea1e6fabd03d935c25b453bf?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/4294","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=4294"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/4294\/revisions"}],"predecessor-version":[{"id":4300,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/4294\/revisions\/4300"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=4294"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=4294"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=4294"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=4294"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}