{"id":6935,"date":"2021-06-15T17:09:05","date_gmt":"2021-06-15T16:09:05","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=6935"},"modified":"2021-06-16T12:21:48","modified_gmt":"2021-06-16T11:21:48","slug":"out-of-distribution-generalisation-and-scaffold-splitting-in-molecular-property-prediction","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2021\/06\/out-of-distribution-generalisation-and-scaffold-splitting-in-molecular-property-prediction\/","title":{"rendered":"Out-of-distribution generalisation and scaffold splitting in molecular property prediction"},"content":{"rendered":"\n<p>The ability to successfully apply previously acquired knowledge to novel and unfamiliar situations is one of the main hallmarks of successful learning and general intelligence. This capability to effectively <strong>generalise<\/strong> is amongst the most desirable properties a prediction model (or a mind, for that matter) can have.<\/p>\n\n\n\n<p>In supervised machine learning, the standard way to evaluate the generalisation power of a prediction model for a given task is to randomly split the whole available data set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> into two sets \u2013 a training set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> and a test set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/>. The model is then subsequently trained on the examples in the training set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> and afterwards its prediction abilities are measured on the untouched examples in the test set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/> via a suitable performance metric.<\/p>\n\n\n\n<p>Since in this scenario the model has never seen any of the examples in <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/> during training, its performance on <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/> must be indicative of its performance on novel data <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Bnew%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{new}}\" class=\"latex\" \/> which it will encounter in the future. Right?<\/p>\n\n\n\n<!--more-->\n\n\n\n<p>No.<\/p>\n\n\n\n<p>In practise, one can regularly observe a situation where a machine learning model which performs well on a randomly selected test set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/> fails spectacularly when confronted with novel data <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Bnew%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{new}}\" class=\"latex\" \/> which was collected at a later point in time, by a different lab, in a different environment, or in some other context that differs from the original context in which the initial data set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> was collected. The reason for this can be found in the <strong>distributional shift<\/strong> between <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> and <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Bnew%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{new}}\" class=\"latex\" \/> which frequently occurs when the data collection context (and thus the data generating process) is altered in some way.<\/p>\n\n\n\n<p>If the data split for the initial data set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> into training set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> and test set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/> is done uniformly at random (as is usual), then both <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> and <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/> follow the same distribution. This random uniform data split is very much in accordance with the framework of classical statistical learning theory [1], where one assumes that a learning model is primarily built to deal with training- and test data examples that have all been sampled independently from the same underlying probability distribution.<\/p>\n\n\n\n<p>Unfortunately, a random uniform data split is rarely a good simulation of practical reality where a newly collected data set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Bnew%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{new}}\" class=\"latex\" \/> which is fed into a machine learning model to obtain predictions almost never follows the data distribution of the data set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> on which the model was originally trained. This distributional shift between the initial training data set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> and the newly collected data set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Bnew%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{new}}\" class=\"latex\" \/> normally leads to a substantial drop in performance of the model on <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Bnew%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{new}}\" class=\"latex\" \/> compared to its performance on a test set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/> which follows the same distribution as <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/>. Thus, splitting the initial data set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> uniformly at random into a test set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/> and a training set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> often leads to overoptimistic results when trying to estimate the predictive abilities of a machine learning model in a practical setting.<\/p>\n\n\n\n<p>To get a more reliable picture of the real-world predictive capabilities of a trained machine learning model one must find a way to model a meaningful distributional shift and build it into the test set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/>. Evaluating the model on <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/> can then provide a measure for the <strong>out-of-distribution generalisation abilities<\/strong> of the model.<\/p>\n\n\n\n<p>Measuring out-of-distribution generalisation is of particular relevance in the field of <strong>molecular property prediction<\/strong> where distributional shifts tend to be large and difficult to handle for machine learning models. Different molecular data sets obtained by distinct pharmaceutical companies and research groups often contain compounds from vastly different areas of chemical space that exhibit high structural heterogeneity. An elegant solution for the modelling of such distributional shifts in chemical space is given by the idea of <strong>scaffold splitting<\/strong>.<\/p>\n\n\n\n<p>The notion of a (two-dimensional) molecular scaffold is described in the article by Bemis and Murcko [2]. A molecular scaffold reduces the chemical structure of a compound to its core components, essentially by removing all side chains and only keeping ring systems and parts which link together ring systems. An additional option for making molecular scaffolds even more general is to \u201cforget\u201d the identities of the bonds and atoms by replacing all atoms with carbons and all bonds with single bonds. <\/p>\n\n\n\n<p>Bemis-Murcko scaffolds can be automatically generated in RDKit via the following Python code:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"enlighter\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\"># how to extract the Bemis-Murcko scaffold of a molecular compound via RDKit\n\n# import packages\nfrom rdkit import Chem\nfrom rdkit.Chem.Scaffolds import MurckoScaffold\n\n# define compound via its SMILES string\nsmiles = \"CN1CCCCC1CCN2C3=CC=CC=C3SC4=C2C=C(C=C4)SC\"\n\n# convert SMILES string to RDKit mol object \nmol = Chem.MolFromSmiles(smiles)\n\n# create RDKit mol object corresponding to Bemis-Murcko scaffold of original compound\nmol_scaffold = MurckoScaffold.GetScaffoldForMol(mol)\n\n# make the scaffold generic by replacing all atoms with carbons and all bonds with single bonds\nmol_scaffold_generic = MurckoScaffold.MakeScaffoldGeneric(mol_scaffold)\n\n# convert the generic scaffold mol object back to a SMILES string format\nsmiles_scaffold_generic = Chem.CanonSmiles(Chem.MolToSmiles(mol_scaffold_generic))\n\n# display compound and its generic Bemis-Murcko scaffold\ndisplay(mol)\nprint(smiles)\ndisplay(mol_scaffold_generic)\nprint(smiles_scaffold_generic)<\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2021\/06\/mol_and_scaffold-4.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"315\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2021\/06\/mol_and_scaffold-4.png?resize=625%2C315&#038;ssl=1\" alt=\"\" class=\"wp-image-6941\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2021\/06\/mol_and_scaffold-4.png?resize=1024%2C516&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2021\/06\/mol_and_scaffold-4.png?resize=300%2C151&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2021\/06\/mol_and_scaffold-4.png?resize=768%2C387&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2021\/06\/mol_and_scaffold-4.png?resize=1536%2C774&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2021\/06\/mol_and_scaffold-4.png?resize=2048%2C1032&amp;ssl=1 2048w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2021\/06\/mol_and_scaffold-4.png?resize=624%2C314&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2021\/06\/mol_and_scaffold-4.png?w=1250&amp;ssl=1 1250w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2021\/06\/mol_and_scaffold-4.png?w=1875&amp;ssl=1 1875w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<p>If we now have a molecular data set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/>, we can map each compound in <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> to its respective scaffold. Let us assume that a total number of <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"s\" class=\"latex\" \/> pairwise distinct scaffolds appear in <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> and that these scaffolds are numbered consecutively from <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=1&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"1\" class=\"latex\" \/> to <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"s\" class=\"latex\" \/>. We can then define an <strong>equivalence relation<\/strong> on <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> by calling two compounds equivalent if they share the same scaffold. The associated equivalence classes consist of compound sets <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B1%7D%2C+...%2C+X_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{1}, ..., X_s\" class=\"latex\" \/> whereby a given set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7Bk%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{k}\" class=\"latex\" \/> contains all compounds in <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> which share the <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=k&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"k\" class=\"latex\" \/>-th scaffold. It is not hard to see that the sets <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B1%7D%2C+...%2C+X_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{1}, ..., X_s\" class=\"latex\" \/> form a partition of the original data set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/>. Without loss of generality, we assume that the equivalence classes <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_1%2C+...%2C+X_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_1, ..., X_s\" class=\"latex\" \/> are ordered by size in descending order, i.e. we assume that <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B1%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{1}\" class=\"latex\" \/> contains at least as many molecules as <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B2%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{2}\" class=\"latex\" \/>, and so on.<\/p>\n\n\n\n<p>One appropriate way to now produce a scaffold split of the molecular data set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/> into a training set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> and a test set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/> for machine learning is to define <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> as the union of the first (larger) sets <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_1%2C+...%2C+X_c&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_1, ..., X_c\" class=\"latex\" \/> and <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/> as the union of the last (smaller) sets <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7Bc%2B1%7D%2C...%2CX_s&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{c+1},...,X_s\" class=\"latex\" \/>. Here <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=c&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"c\" class=\"latex\" \/> is a custom index parameter which can be used to control the respective sizes of <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> and <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/>; frequently <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=c&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"c\" class=\"latex\" \/> is chosen such that <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> contains approximately <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=80+%5C%25&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"80 &#92;%\" class=\"latex\" \/> of the examples in <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X\" class=\"latex\" \/>. <\/p>\n\n\n\n<p>While a scaffold split is certainly not perfect, it is already a lot better than a uniform random split at providing a relevant measure of the practical utility of a molecular property prediction model. It mimics a situation where the training set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> was sampled from a structurally different area of chemical space than the test set <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/>. This creates a distributional shift between <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btrain%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{train}}\" class=\"latex\" \/> and <img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/s0.wp.com\/latex.php?latex=X_%7B%5Ctext%7Btest%7D%7D&#038;bg=ffffff&#038;fg=000&#038;s=0&#038;c=20201002\" alt=\"X_{&#92;text{test}}\" class=\"latex\" \/> which is comparable to the distributional shifts which are commonly observed in real chemical data sets. Evaluating a molecular machine learning model using a scaffold split rather than a uniform random split thus leads to significantly more robust results.<\/p>\n\n\n\n<p><strong>References:<\/strong><\/p>\n\n\n\n<p>[1] Poggio, Tomaso, and Christian R. Shelton. &#8220;On the mathematical foundations of learning.&#8221; <em>American Mathematical Society<\/em> 39.1 (2002): 1-49.<\/p>\n\n\n\n<p>[2] Bemis, Guy W., and Mark A. Murcko. &#8220;The properties of known drugs. 1. Molecular frameworks.&#8221; <em>Journal of medicinal chemistry<\/em> 39.15 (1996): 2887-2893.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The ability to successfully apply previously acquired knowledge to novel and unfamiliar situations is one of the main hallmarks of successful learning and general intelligence. This capability to effectively generalise is amongst the most desirable properties a prediction model (or a mind, for that matter) can have. In supervised machine learning, the standard way to [&hellip;]<\/p>\n","protected":false},"author":84,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[187,29,189,221,227,201,15],"tags":[130,87,172,152,134],"ppma_author":[556],"class_list":["post-6935","post","type-post","status-publish","format-standard","hentry","category-cheminformatics","category-code","category-machine-learning","category-python","category-python-code","category-small-molecules","category-technical","tag-cheminformatics","tag-code-2","tag-machine-learning","tag-python","tag-small-molecules"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":556,"user_id":84,"is_guest":0,"slug":"markusd","display_name":"Markus Dablander","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/d0047b5862940cb3a1b68dfa3f0735d6602b1e619fb299881b56cbf60d9fd8e1?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/6935","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/84"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=6935"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/6935\/revisions"}],"predecessor-version":[{"id":6969,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/6935\/revisions\/6969"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=6935"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=6935"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=6935"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=6935"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}