Mol2vec: Finding Chemical Meaning in 300 Dimensions

Embeddings of Amino Acids

2D projections (t-SNE) of Mol2vec vectors of amino acids (bold arrows). These vectors were obtained by summing the vectors of the Morgan substructures (small arrows) present in the respective molecules (amino acids in the present example). The directions of the vectors provide a visual representation of similarities. Magnitudes reflect importance, i.e. more meaningful words. [Figure from Ref. 1]

Natural Language Processing (NLP) algorithms are usually used for analyzing human communication, often in the form of textual information such as scientific papers and Tweets. One aspect, coming up with a representation that clusters words with similar meanings, has been achieved very successfully with the word2vec approach. This involves training a shallow, two-layer artificial neural network on a very large body of words and sentences — the so-called corpus — to generate “embeddings” of the constituent words into a high-dimensional space. By computing the vector from “woman” to “queen”, and adding it to the position of “man” in this high-dimensional space, the answer, “king”, can be found.

A recent publication of one of my former InhibOx-colleagues, Simone Fulle, and her co-workers, Sabrina Jaeger and Samo Turk, shows how we can embed molecular substructures and chemical compounds into a similarly high-dimensional, continuous vectorial representation, which they dubbed “mol2vec“.1 They also released a Python implementation, available on Samo Turk’s GitHub repository.

 

The paper described how they assembled a corpus of 19.9 million molecules from ChEMBL version 23 and ZINC 15, two databases of small molecules, one storing bioactivity data, and the other commercially-available compounds; converted them into canonical SMILES strings using RDKit 2017.03.3, and extracted the substructures that contributed to a Morgan fingerprints with radius of one. By using the gensim implementation of  word2vec, and exploring both CBOW (Continuous Bag Of Words) and Skip-gram models, as well as various window sizes (CBOW: w=5 & 10; Skip-gram: w=10 & 20) and dimensional embeddings (100D and 300D), the best embedding was found using Skip-gram, w=10, and 300D.

Mol2vec vectors were also combined with ProtVec2 vectors to explore the influence of adding information about the protein (when known) to models built using chemical compound features. Such models are sometimes referred to as proteochemometric models (PCM).

Using a rigorous four-level cross validation scheme, which they referred to as CV1, CV2, CV3, and CV4, they investigated the influence of knowledge about compounds, target proteins, and both, on the accuracy of their PCM models:

    • CV1: tests performance on unknown compound-target pairs;
      • easiest scenario, because individual compounds and/or targets (kinases) might have already been present in the training data.
    • CV2: tests performance on new targets by leaving out targets in training data.
    • CV3: tests on unknown compounds.
    • CV4: tests on unknown compounds and new targets.

Validation studies were carried out by generating models using Random Forests (RF), Gradient Boosted Machines (GBM), and Deep Neural Networks (DNN), in both regression and classification tasks:

  • Solubility, Mutagenicity, Toxicity, and Kinase Specificity
    • ESOL: predict aqueous solubility of 1,144 compounds;
    • Ames Mutagenicity Dataset: 3,481 mutagens + 2,990 non-mutagens;
    • Tox21 Dataset: 12 targets associated with human toxicity, 8,192 compounds
    • Kinase Dataset: IC50, Kd, Ki ChEMBL bioassays with confidence ≥ 8, converted to pIC50 values, thresholded at 6.3

Using mol2vec embeddings as compound features, the following models were built:

  • Random Forest from sklearn, max. feats = √number of features; balanced class weight
  • Gradient Boosting Machine (GBM) from XGBoost, 2000 estimators, max. depth of trees = 3, learning rate = 0.1
    • GBMClassifier: weight of positive samples adjusted to reflect nactives/ninactives
  • Deep Neural Networks (DNN)
    • DNN with Binary FPs — MFPr=2, 512 neurons → output layer: 1 neuron
    • DNN with Continuous Mol2vec vectors — 4 hidden layers, each 2000 neurons → 1 neuron

All models were trained using 20 x 5-fold cross validations, and compared using Wilcoxon signed rank tests. Regression tasks were evaluated using coefficient of determination R2cv, Mean absolute error, MAE; and Mean squared error, MSE. Classification tasks were assessed using the Area under the Receiver Operating Characteristic curve, AUC; Sensitivity, TPR; and Specificity, TNR.

While there were some instances where mol2vec performed better than other features, and scenarios where mol2vec RF models did better than Support Vector Machines (SVM), Naïve Bayes Classifiers (NBC), Convolutional Neural Networks (CNN), and DNNs, overall they concluded that:

  • Mol2vec yields state-of-the-art performance for both classification and regression on a variety of datasets.
  • Mol2vec with GBM seems to be very suitable for regression tasks (e.g. Ames).
  • Mol2vec with RF is recommended for classification tasks, including PCM models.
  • Although several DNN architectures were evaluated, they were still outperformed by tree-based methods (GBM and RF). 
    • Their DNNs need tuning.

Perhaps the greatest advantages of 300D mol2vec over 2048-bit Morgan fingerprints are that they require less memory when training models, and are faster to train. They probably require less training data, as evidenced by the significantly better performance in predicting the aqueous solubility of 1144 compounds in the ESOL set (R2cv-value of 0.86 for mol2vec-GBM, versus 0.66 for Morgan FP-GBM models).

To try mol2vec out, there are some great Jupyter Notebooks — and documentation — worth looking at:

References

1. Jaeger, S.; Fulle, S.; Turk, S., Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. J Chem Inf Model 2018, 58, 27-35.

2. Asgari, E.; Mofrad, M. R., Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS One 2015, 10, e0141287.

Author