{"id":11543,"date":"2024-08-15T18:45:56","date_gmt":"2024-08-15T17:45:56","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=11543"},"modified":"2024-08-18T11:52:09","modified_gmt":"2024-08-18T10:52:09","slug":"sort-and-slice-tutorial-an-alternative-to-extended-connectivity-fingerprints","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2024\/08\/sort-and-slice-tutorial-an-alternative-to-extended-connectivity-fingerprints\/","title":{"rendered":"Sort and Slice Tutorial &#8211; An alternative to extended connectivity fingerprints"},"content":{"rendered":"\n<main>\n\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<h3 id=\"Background\" style='font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif;font-weight: 500;line-height: 1.1;color: inherit;margin-top: 18px;margin-bottom: 9px;font-size: 23px'>Background<a class=\"anchor-link\" href=\"#Background\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">\u00b6<\/a><\/h3>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\"><a href=\"https:\/\/doi.org\/10.48550\/arXiv.2403.17954\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">Sort and Slice<\/a> (SNS) was developed by a former OPIGlet, Markus, as a method for improving <a href=\"https:\/\/doi.org\/10.1021\/ci100050t\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">Extended Connectivity Fingerprints (ECFPs)<\/a> by overcoming bit collisions. ECFPs are a form of topological fingerprint which denote the absence and presence of circular substructures in a molecule. The steps for deriving an ECFP from a molecule are as follows:<\/p>\n<ol style=\"margin-top: 0;margin-bottom: 9px\">\n<li><p style=\"margin: 1em 0 9px;margin-bottom: 0\">Identifier assignment:<\/p>\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">Each atom in the molecule is assigned an initial numerical identifier; this is typically generated by hashing a tuple of atomic properties called <a href=\"https:\/\/doi.org\/10.1021\/ci00062a008\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">Daylight atomic invariants<\/a> into a 32-bit integer. These properties are:<\/p>\n<ol style=\"margin-top: 0;margin-bottom: 0\">\n<li>Number of non-hydrogen neighbours.<\/li>\n<li>Valence &#8211; number of neighbouring hydrogens.<\/li>\n<li>Atomic number.<\/li>\n<li>Atomic mass.<\/li>\n<li>Atomic charge.<\/li>\n<li>Number of hydrogen neighbours.<\/li>\n<li>Ring membership.*<\/li>\n<\/ol>\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">*Ring membership is an additional property that is often used but is not one of the original Daylight atomic invariants.<\/p>\n<\/li>\n\n\n\n<!--more-->\n\n\n\n<li><p style=\"margin: 1em 0 9px;margin-bottom: 0\">Iterative updating:<\/p>\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">Each substructure identifier is updated by hashing the central identifier with all the identifiers of its neighbours, i.e. performing a 1-hop aggregation. This processes is repeated upto a predetermined maximum radius.<\/p>\n<\/li>\n<li><p style=\"margin: 1em 0 9px;margin-bottom: 0\">Duplicate removal:<\/p>\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">Duplicate substructures are identified and removed.<\/p>\n<\/li>\n<li><p style=\"margin: 1em 0 9px;margin-bottom: 0\">Folding into a smaller bit vector:<\/p>\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">The 32-bit integers are used to produce a sparse vector, with &#8216;on&#8217; bits for each substructure identifier in the molecule. This sparse bit vector is folded into smaller fixed length bit vector, typically to 1024 or 2048 bits. This final folding step often produces &#8216;bit collisions&#8217;, with the same bit being activated by different substructures, adding noise to the representation.<\/p>\n<\/li>\n<\/ol>\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">For background more on ECFPs, I recommend <a href=\"https:\/\/doi.org\/10.1021\/ci100050t\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">this paper by Rogers and Hahn (2010)<\/a> which introduced ECFPs, Leo&#8217;s blopig post on <a href=\"https:\/\/www.blopig.com\/blog\/2022\/06\/exploring-topological-fingerprints-in-rdkit\/\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">topological fingerprints<\/a>, Markus&#8217; blopig post on <a href=\"https:\/\/www.blopig.com\/blog\/2022\/11\/how-to-turn-a-smiles-string-into-an-extended-connectivity-fingerprint-using-rdkit\/\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">turning SMILES into ECFPs with RDKit<\/a>, and Greg Landrum&#8217;s blog post on <a href=\"https:\/\/greglandrum.github.io\/rdkit-blog\/posts\/2023-01-18-fingerprint-generator-tutorial.html\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">fingerprint generators in RDKit<\/a>.<\/p>\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">SNS is a form of dataset dependent feature selection applied to ECFPs. Instead of folding the final sparse vector, SNS retains only the most frequent substructures in a previously observed dataset. Let&#8217;s see how we might do this in Python.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div><div class=\"jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[1]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"cm-editor cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\" style=\"background: #f8f8f8\"><pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\"><span><\/span><span class=\"c1\" style=\"color: #408080;font-style: italic\"># imports<\/span>\n<span class=\"kn\" style=\"color: #008000;font-weight: bold\">import<\/span> <span class=\"nn\" style=\"color: #00F;font-weight: bold\">pandas<\/span> <span class=\"k\" style=\"color: #008000;font-weight: bold\">as<\/span> <span class=\"nn\" style=\"color: #00F;font-weight: bold\">pd<\/span>\n<span class=\"kn\" style=\"color: #008000;font-weight: bold\">from<\/span> <span class=\"nn\" style=\"color: #00F;font-weight: bold\">rdkit.Chem.rdFingerprintGenerator<\/span> <span class=\"kn\" style=\"color: #008000;font-weight: bold\">import<\/span> <span class=\"n\">AdditionalOutput<\/span><span class=\"p\">,<\/span> <span class=\"n\">GetMorganGenerator<\/span>\n<span class=\"kn\" style=\"color: #008000;font-weight: bold\">import<\/span> <span class=\"nn\" style=\"color: #00F;font-weight: bold\">numpy<\/span> <span class=\"k\" style=\"color: #008000;font-weight: bold\">as<\/span> <span class=\"nn\" style=\"color: #00F;font-weight: bold\">np<\/span>\n<span class=\"kn\" style=\"color: #008000;font-weight: bold\">from<\/span> <span class=\"nn\" style=\"color: #00F;font-weight: bold\">sklearn.model_selection<\/span> <span class=\"kn\" style=\"color: #008000;font-weight: bold\">import<\/span> <span class=\"n\">train_test_split<\/span>\n<span class=\"kn\" style=\"color: #008000;font-weight: bold\">from<\/span> <span class=\"nn\" style=\"color: #00F;font-weight: bold\">rdkit<\/span> <span class=\"kn\" style=\"color: #008000;font-weight: bold\">import<\/span> <span class=\"n\">Chem<\/span>\n<span class=\"kn\" style=\"color: #008000;font-weight: bold\">from<\/span> <span class=\"nn\" style=\"color: #00F;font-weight: bold\">tqdm<\/span> <span class=\"kn\" style=\"color: #008000;font-weight: bold\">import<\/span> <span class=\"n\">tqdm<\/span> <span class=\"c1\" style=\"color: #408080;font-style: italic\"># progress bar<\/span>\n<span class=\"kn\" style=\"color: #008000;font-weight: bold\">import<\/span> <span class=\"nn\" style=\"color: #00F;font-weight: bold\">jdc<\/span> <span class=\"c1\" style=\"color: #408080;font-style: italic\"># allows dynamic class definition across cells<\/span>\n<span class=\"kn\" style=\"color: #008000;font-weight: bold\">import<\/span> <span class=\"nn\" style=\"color: #00F;font-weight: bold\">warnings<\/span>\n<\/pre><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<h3 id=\"Data\" style='font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif;font-weight: 500;line-height: 1.1;color: inherit;margin-top: 18px;margin-bottom: 9px;font-size: 23px'>Data<a class=\"anchor-link\" href=\"#Data\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">\u00b6<\/a><\/h3>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">First of all we need a dataset to test our code on. <a href=\"https:\/\/doi.org\/10.1038\/sdata.2014.22\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">QM9<\/a> is a commonly used dataset of small molecules with calculated quantum properties, which can be easily downloaded from <a href=\"https:\/\/pubs.rsc.org\/en\/content\/articlelanding\/2018\/sc\/c7sc02664a\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">MoleculeNet<\/a> using pandas:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div><div class=\"jp-Cell jp-CodeCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[2]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"cm-editor cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\" style=\"background: #f8f8f8\"><pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\"><span><\/span><span class=\"n\">qm9<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"s1\" style=\"color: #BA2121\">'https:\/\/deepchemdata.s3-us-west-1.amazonaws.com\/datasets\/qm9.csv'<\/span>\n<span class=\"n\">qm9<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">pd<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">read_csv<\/span><span class=\"p\">(<\/span><span class=\"n\">qm9<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">qm9<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">head<\/span><span class=\"p\">()<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">iloc<\/span><span class=\"p\">[:,:<\/span><span class=\"mi\" style=\"color: #666\">2<\/span><span class=\"p\">]<\/span>\n<\/pre><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell-outputWrapper\">\n<div class=\"jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser\">\n<\/div>\n<div class=\"jp-OutputArea jp-Cell-outputArea\">\n<div class=\"jp-OutputArea-child jp-OutputArea-executeResult\">\n<div class=\"jp-OutputPrompt jp-OutputArea-prompt\">Out[2]:<\/div>\n<div class=\"jp-RenderedHTMLCommon jp-RenderedHTML jp-OutputArea-output jp-OutputArea-executeResult\" data-mime-type=\"text\/html\">\n<div>\n<table border=\"1\" class=\"dataframe\" style=\"border-collapse: collapse;border-spacing: 0;background-color: transparent\">\n<thead>\n<tr style=\"text-align: right\">\n<th style=\"padding: 0;text-align: right\"><\/th>\n<th style=\"padding: 0;text-align: left\">mol_id<\/th>\n<th style=\"padding: 0;text-align: left\">smiles<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<th style=\"padding: 0;text-align: left;vertical-align: top\">0<\/th>\n<td style=\"padding: 0\">gdb_1<\/td>\n<td style=\"padding: 0\">C<\/td>\n<\/tr>\n<tr>\n<th style=\"padding: 0;text-align: left;vertical-align: top\">1<\/th>\n<td style=\"padding: 0\">gdb_2<\/td>\n<td style=\"padding: 0\">N<\/td>\n<\/tr>\n<tr>\n<th style=\"padding: 0;text-align: left;vertical-align: top\">2<\/th>\n<td style=\"padding: 0\">gdb_3<\/td>\n<td style=\"padding: 0\">O<\/td>\n<\/tr>\n<tr>\n<th style=\"padding: 0;text-align: left;vertical-align: top\">3<\/th>\n<td style=\"padding: 0\">gdb_4<\/td>\n<td style=\"padding: 0\">C#C<\/td>\n<\/tr>\n<tr>\n<th style=\"padding: 0;text-align: left;vertical-align: top\">4<\/th>\n<td style=\"padding: 0\">gdb_5<\/td>\n<td style=\"padding: 0\">C#N<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">Let&#8217;s perform a train\/test split and convert the SMILES strings into RDKit molecule objects. This will give us train data to generate an SNS featurizer from and test data to run the SNS featurizer over.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div><div class=\"jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[3]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"cm-editor cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\" style=\"background: #f8f8f8\"><pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\"><span><\/span><span class=\"n\">train<\/span><span class=\"p\">,<\/span> <span class=\"n\">test<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">train_test_split<\/span><span class=\"p\">(<\/span><span class=\"n\">qm9<\/span><span class=\"p\">,<\/span> <span class=\"n\">test_size<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"mf\" style=\"color: #666\">0.4<\/span><span class=\"p\">,<\/span> <span class=\"n\">random_state<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"mi\" style=\"color: #666\">42<\/span><span class=\"p\">)<\/span>\n\n<span class=\"n\">train_smiles<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">train<\/span><span class=\"p\">[<\/span><span class=\"s1\" style=\"color: #BA2121\">'smiles'<\/span><span class=\"p\">]<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">to_list<\/span><span class=\"p\">()<\/span>\n<span class=\"n\">test_smiles<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">test<\/span><span class=\"p\">[<\/span><span class=\"s1\" style=\"color: #BA2121\">'smiles'<\/span><span class=\"p\">]<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">to_list<\/span><span class=\"p\">()<\/span>\n\n<span class=\"n\">train_mols<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"p\">[<\/span><span class=\"n\">Chem<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">MolFromSmiles<\/span><span class=\"p\">(<\/span><span class=\"n\">smi<\/span><span class=\"p\">)<\/span> <span class=\"k\" style=\"color: #008000;font-weight: bold\">for<\/span> <span class=\"n\">smi<\/span> <span class=\"ow\" style=\"color: #A2F;font-weight: bold\">in<\/span> <span class=\"n\">train_smiles<\/span><span class=\"p\">]<\/span>\n<span class=\"n\">test_mols<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"p\">[<\/span><span class=\"n\">Chem<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">MolFromSmiles<\/span><span class=\"p\">(<\/span><span class=\"n\">smi<\/span><span class=\"p\">)<\/span> <span class=\"k\" style=\"color: #008000;font-weight: bold\">for<\/span> <span class=\"n\">smi<\/span> <span class=\"ow\" style=\"color: #A2F;font-weight: bold\">in<\/span> <span class=\"n\">test_smiles<\/span><span class=\"p\">]<\/span>\n<\/pre><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<h3 id=\"Coding-Sort-and-Slice\" style='font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif;font-weight: 500;line-height: 1.1;color: inherit;margin-top: 18px;margin-bottom: 9px;font-size: 23px'>Coding Sort and Slice<a class=\"anchor-link\" href=\"#Coding-Sort-and-Slice\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">\u00b6<\/a><\/h3>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">Now we have our data, how can we build a <code style=\"font-family: monospace;padding: 2px 4px;font-size: 90%;background-color: #f9f2f4;border-radius: 2px;color: #000\">SortAndSlice<\/code> class? There are several components our class needs to possess:<\/p>\n<ol style=\"margin-top: 0;margin-bottom: 9px\">\n<li>An RDKit Morgan Fingerprint generator for performing substructure identifier generation.<\/li>\n<li>An RDKit AdditionalOutput object for obtaining substructure identifiers from the Morgan generator.<\/li>\n<li>A method for obtaining all the substructure identifiers in the training dataset.<\/li>\n<li>A dictionary for holding all the identifiers in the training dataset.<\/li>\n<li>An encoding dictionary of a fixed length for mapping substructures to bits.<\/li>\n<li>A method for converting new molecules to SNS bit vectors.<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<h4 id=\"Class-Constructor\" style='font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif;font-weight: 500;line-height: 1.1;color: inherit;margin-top: 9px;margin-bottom: 9px;font-size: 17px'>Class Constructor<a class=\"anchor-link\" href=\"#Class-Constructor\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">\u00b6<\/a><\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">Below is the constructor method of our <code style=\"font-family: monospace;padding: 2px 4px;font-size: 90%;background-color: #f9f2f4;border-radius: 2px;color: #000\">SortAndSlice<\/code> class. This method takes a list of training data molecules, a Morgan Generator, the desired bit length, and a verbosity flag as input arguments. The constructor defines a fingerprint generator and AdditionalOutput instance as attributes of our class, before computing the identifiers in the dataset and the encoder.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div><div class=\"jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[4]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"cm-editor cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\" style=\"background: #f8f8f8\"><pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\"><span><\/span><span class=\"k\" style=\"color: #008000;font-weight: bold\">class<\/span> <span class=\"nc\" style=\"color: #00F;font-weight: bold\">SortAndSlice<\/span><span class=\"p\">:<\/span>\n\n    <span class=\"n\">encoder<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"p\">{}<\/span>\n\n    <span class=\"k\" style=\"color: #008000;font-weight: bold\">def<\/span> <span class=\"fm\">__init__<\/span><span class=\"p\">(<\/span>\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"p\">,<\/span>\n        <span class=\"n\">data<\/span><span class=\"p\">:<\/span> <span class=\"nb\" style=\"color: #008000\">list<\/span><span class=\"p\">[<\/span><span class=\"n\">Chem<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">Mol<\/span><span class=\"p\">],<\/span>\n        <span class=\"n\">generator<\/span><span class=\"p\">:<\/span> <span class=\"n\">GetMorganGenerator<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">GetMorganGenerator<\/span><span class=\"p\">(<\/span><span class=\"n\">radius<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"mi\" style=\"color: #666\">2<\/span><span class=\"p\">),<\/span>\n        <span class=\"n\">bit_length<\/span><span class=\"p\">:<\/span> <span class=\"nb\" style=\"color: #008000\">int<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"mi\" style=\"color: #666\">2048<\/span><span class=\"p\">,<\/span>\n        <span class=\"n\">verbose<\/span><span class=\"p\">:<\/span> <span class=\"nb\" style=\"color: #008000\">bool<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"kc\" style=\"color: #008000;font-weight: bold\">False<\/span><span class=\"p\">,<\/span>\n    <span class=\"p\">):<\/span>\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">generator<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">generator<\/span>\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">ao<\/span><span class=\"p\">:<\/span> <span class=\"n\">AdditionalOutput<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">AdditionalOutput<\/span><span class=\"p\">()<\/span>\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">ao<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">AllocateBitInfoMap<\/span><span class=\"p\">()<\/span>\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">verbose<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">verbose<\/span>\n\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">identifiers<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">get_identifiers<\/span><span class=\"p\">(<\/span><span class=\"n\">data<\/span><span class=\"p\">)<\/span>\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">set_encoder<\/span><span class=\"p\">(<\/span><span class=\"n\">bit_length<\/span><span class=\"p\">)<\/span>\n\n    \n<\/pre><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<h4 id=\"Sort\" style='font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif;font-weight: 500;line-height: 1.1;color: inherit;margin-top: 9px;margin-bottom: 9px;font-size: 17px'>Sort<a class=\"anchor-link\" href=\"#Sort\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">\u00b6<\/a><\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">With our constructor written, the next requirement is a method which calculates the frequency of substructure identifiers across our dataset and sorts the identifiers by frequency. The method below, <code style=\"font-family: monospace;padding: 2px 4px;font-size: 90%;background-color: #f9f2f4;border-radius: 2px;color: #000\">get_identifiers<\/code>, does the following:<\/p>\n<ol style=\"margin-top: 0;margin-bottom: 9px\">\n<li>Creates an empty dictionary to tally substructure identifiers.<\/li>\n<li>Iterates over each molecule in the input data. For each iteration:<ol style=\"margin-top: 0;margin-bottom: 0\">\n<li>Calculates the sparse Morgan fingerprint of a molecule.<\/li>\n<li>Obtains all the substructure integer identifiers from the additional output object.<\/li>\n<li>For each integer, adds 1 to the total tally.<\/li>\n<\/ol>\n<\/li>\n<li>Sorts the identifiers by frequency.<\/li>\n<li>Returns the identifiers as a dictionary, with identifiers as keys and tallies as values.<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n<\/div><div class=\"jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[5]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"cm-editor cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\" style=\"background: #f8f8f8\"><pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\"><span><\/span><span class=\"o\" style=\"color: #666\">%%<\/span><span class=\"k\" style=\"color: #008000;font-weight: bold\">add_to<\/span> SortAndSlice\ndef get_identifiers(self, data: list[Chem.Mol]) -&gt; dict[int, int]:\n    identifiers = {}\n    pbar = tqdm(total=len(data), desc='Collecting identifiers', disable=not self.verbose)\n    for mol in data:\n        self.generator.GetSparseFingerprint(mol, additionalOutput=self.ao)\n        bitmap = self.ao.GetBitInfoMap()\n        for identifier in bitmap:\n            count = identifiers.get(identifier, 0)\n            identifiers[identifier] = count + 1\n        pbar.update(1)\n    pbar.close()\n    identifiers = dict(sorted(identifiers.items(), key=lambda x: x[1], reverse=True))\n    return identifiers\n<\/pre><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<h4 id=\"Slice\" style='font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif;font-weight: 500;line-height: 1.1;color: inherit;margin-top: 9px;margin-bottom: 9px;font-size: 17px'>Slice<a class=\"anchor-link\" href=\"#Slice\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">\u00b6<\/a><\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">Having computed the identifiers in our dataset and sorted them by frequency, the slicing operation comes next. The method <code style=\"font-family: monospace;padding: 2px 4px;font-size: 90%;background-color: #f9f2f4;border-radius: 2px;color: #000\">set_encoder<\/code> takes an integer, <code style=\"font-family: monospace;padding: 2px 4px;font-size: 90%;background-color: #f9f2f4;border-radius: 2px;color: #000\">bit_length<\/code>, as an argument. The <code style=\"font-family: monospace;padding: 2px 4px;font-size: 90%;background-color: #f9f2f4;border-radius: 2px;color: #000\">SortAndSlice<\/code> encoder is then set by taking only the top <code style=\"font-family: monospace;padding: 2px 4px;font-size: 90%;background-color: #f9f2f4;border-radius: 2px;color: #000\">bit_length<\/code> substructures by frequency and labelling them with a rank, i.e. the identifier dictionary is sliced and enumerated.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div><div class=\"jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[6]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"cm-editor cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\" style=\"background: #f8f8f8\"><pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\"><span><\/span><span class=\"o\" style=\"color: #666\">%%<\/span><span class=\"k\" style=\"color: #008000;font-weight: bold\">add_to<\/span> SortAndSlice\ndef set_encoder(self, bit_length: int):\n    if self.verbose:\n        print(f'Setting bit length of encoder to a max of {bit_length}.')\n    encoder = {}\n    for i, k in enumerate(self.identifiers.keys()):\n        if i &gt;= bit_length:\n            break\n        encoder[k] = i\n\n    self.encoder = encoder\n    if len(encoder) &lt; bit_length:\n       warnings.warn(f'Encoder is only {len(encoder)} bits long.')\n\n    if self.verbose:\n        print(f'Encoder set to {len(encoder)} bits.')\n<\/pre><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<h4 id=\"Featurization\" style='font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif;font-weight: 500;line-height: 1.1;color: inherit;margin-top: 9px;margin-bottom: 9px;font-size: 17px'>Featurization<a class=\"anchor-link\" href=\"#Featurization\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">\u00b6<\/a><\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">With an encoder dictionary now set, new molecules can be encoded based on the molecules that have already been seen. The <code style=\"font-family: monospace;padding: 2px 4px;font-size: 90%;background-color: #f9f2f4;border-radius: 2px;color: #000\">encode<\/code> method takes an rdkit molecule as an argument and performs the following:<\/p>\n<ol style=\"margin-top: 0;margin-bottom: 9px\">\n<li>Calculates the sparse morgan fingerprint of the molecule.<\/li>\n<li>Obtains the 32-bit substructure identifiers in the molecule.<\/li>\n<li>Generates an output vector of zeros that&#8217;s the length of the encoder.<\/li>\n<li>Iterates over the identifiers in the molecule. For each identifier:<ol style=\"margin-top: 0;margin-bottom: 0\">\n<li>Checks if the identifier is in the encoder.<\/li>\n<li>If the identifier is encoded, the rank of the identifier is used to turn a bit in the output vector to 1.<\/li>\n<\/ol>\n<\/li>\n<li>Returns the SNS feature vector for the molecule.<\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n<\/div><div class=\"jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[7]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"cm-editor cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\" style=\"background: #f8f8f8\"><pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\"><span><\/span><span class=\"o\" style=\"color: #666\">%%<\/span><span class=\"k\" style=\"color: #008000;font-weight: bold\">add_to<\/span> SortAndSlice\ndef encode(self, mol: Chem.Mol) -&gt; np.ndarray:\n    self.generator.GetSparseFingerprint(mol, additionalOutput=self.ao)\n    bitmap = self.ao.GetBitInfoMap()\n    out = np.zeros(len(self.encoder))\n    for identifier in bitmap:\n        if identifier in self.encoder:\n            out[self.encoder[identifier]] = 1\n    return out\n<\/pre><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<h4 id=\"Usage\" style='font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif;font-weight: 500;line-height: 1.1;color: inherit;margin-top: 9px;margin-bottom: 9px;font-size: 17px'>Usage<a class=\"anchor-link\" href=\"#Usage\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">\u00b6<\/a><\/h4>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">Let&#8217;s see how this <code style=\"font-family: monospace;padding: 2px 4px;font-size: 90%;background-color: #f9f2f4;border-radius: 2px;color: #000\">SortAndSlice<\/code> class can be used. Below, a Morgan Fingerprint generator is defined. Next, this generator and the QM9 training molecules are used as arguments for an instance of the SortAndSlice class:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div><div class=\"jp-Cell jp-CodeCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[8]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"cm-editor cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\" style=\"background: #f8f8f8\"><pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\"><span><\/span><span class=\"n\">gen<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">GetMorganGenerator<\/span><span class=\"p\">(<\/span><span class=\"n\">radius<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"mi\" style=\"color: #666\">2<\/span><span class=\"p\">,<\/span> <span class=\"n\">includeChirality<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"kc\" style=\"color: #008000;font-weight: bold\">True<\/span><span class=\"p\">)<\/span>\n<span class=\"n\">sns<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">SortAndSlice<\/span><span class=\"p\">(<\/span><span class=\"n\">train_mols<\/span><span class=\"p\">,<\/span> <span class=\"n\">generator<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"n\">gen<\/span><span class=\"p\">,<\/span> <span class=\"n\">bit_length<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"mi\" style=\"color: #666\">2048<\/span><span class=\"p\">,<\/span> <span class=\"n\">verbose<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"kc\" style=\"color: #008000;font-weight: bold\">True<\/span><span class=\"p\">)<\/span>\n<\/pre><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell-outputWrapper\">\n<div class=\"jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser\">\n<\/div>\n<div class=\"jp-OutputArea jp-Cell-outputArea\">\n<div class=\"jp-OutputArea-child\">\n<div class=\"jp-OutputPrompt jp-OutputArea-prompt\"><\/div>\n<div class=\"jp-RenderedText jp-OutputArea-output\" data-mime-type=\"application\/vnd.jupyter.stderr\">\n<pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\">Collecting identifiers: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 80331\/80331 [00:02&lt;00:00, 31337.66it\/s]<\/pre>\n<\/div>\n<\/div>\n<div class=\"jp-OutputArea-child\">\n<div class=\"jp-OutputPrompt jp-OutputArea-prompt\"><\/div>\n<div class=\"jp-RenderedText jp-OutputArea-output\" data-mime-type=\"text\/plain\">\n<pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\">Setting bit length of encoder to a max of 2048.\nEncoder set to 2048 bits.\n<\/pre>\n<\/div>\n<\/div>\n<div class=\"jp-OutputArea-child\">\n<div class=\"jp-OutputPrompt jp-OutputArea-prompt\"><\/div>\n<div class=\"jp-RenderedText jp-OutputArea-output\" data-mime-type=\"application\/vnd.jupyter.stderr\">\n<pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\">\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">To use this <code style=\"font-family: monospace;padding: 2px 4px;font-size: 90%;background-color: #f9f2f4;border-radius: 2px;color: #000\">SortAndSlice<\/code> instance to featurize a set of molecules, we can do the following:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div><div class=\"jp-Cell jp-CodeCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[9]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"cm-editor cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\" style=\"background: #f8f8f8\"><pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\"><span><\/span><span class=\"n\">test_sort_and_slice<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"p\">[<\/span><span class=\"n\">sns<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">encode<\/span><span class=\"p\">(<\/span><span class=\"n\">mol<\/span><span class=\"p\">)<\/span> <span class=\"k\" style=\"color: #008000;font-weight: bold\">for<\/span> <span class=\"n\">mol<\/span> <span class=\"ow\" style=\"color: #A2F;font-weight: bold\">in<\/span> <span class=\"n\">test_mols<\/span><span class=\"p\">]<\/span>\n<span class=\"nb\" style=\"color: #008000\">print<\/span><span class=\"p\">(<\/span>\n    <span class=\"sa\">f<\/span><span class=\"s1\" style=\"color: #BA2121\">'Number of feature vectors: <\/span><span class=\"si\" style=\"color: #B68;font-weight: bold\">{<\/span><span class=\"nb\" style=\"color: #008000\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">test_sort_and_slice<\/span><span class=\"p\">)<\/span><span class=\"si\" style=\"color: #B68;font-weight: bold\">}<\/span><span class=\"s1\" style=\"color: #BA2121\">,<\/span><span class=\"se\" style=\"color: #B62;font-weight: bold\">\\n\\<\/span>\n<span class=\"s1\" style=\"color: #BA2121\">    Length of the 0th feature vector: <\/span><span class=\"si\" style=\"color: #B68;font-weight: bold\">{<\/span><span class=\"n\">test_sort_and_slice<\/span><span class=\"p\">[<\/span><span class=\"mi\" style=\"color: #666\">0<\/span><span class=\"p\">]<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">shape<\/span><span class=\"p\">[<\/span><span class=\"mi\" style=\"color: #666\">0<\/span><span class=\"p\">]<\/span><span class=\"si\" style=\"color: #B68;font-weight: bold\">}<\/span><span class=\"s1\" style=\"color: #BA2121\">,<\/span><span class=\"se\" style=\"color: #B62;font-weight: bold\">\\n\\<\/span>\n<span class=\"s1\" style=\"color: #BA2121\">    On bits in the 0th feature vector: <\/span><span class=\"si\" style=\"color: #B68;font-weight: bold\">{<\/span><span class=\"n\">np<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">where<\/span><span class=\"p\">(<\/span><span class=\"n\">test_sort_and_slice<\/span><span class=\"p\">[<\/span><span class=\"mi\" style=\"color: #666\">0<\/span><span class=\"p\">])[<\/span><span class=\"mi\" style=\"color: #666\">0<\/span><span class=\"p\">]<\/span><span class=\"si\" style=\"color: #B68;font-weight: bold\">}<\/span><span class=\"s1\" style=\"color: #BA2121\">'<\/span>\n<span class=\"p\">)<\/span>\n<\/pre><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell-outputWrapper\">\n<div class=\"jp-Collapser jp-OutputCollapser jp-Cell-outputCollapser\">\n<\/div>\n<div class=\"jp-OutputArea jp-Cell-outputArea\">\n<div class=\"jp-OutputArea-child\">\n<div class=\"jp-OutputPrompt jp-OutputArea-prompt\"><\/div>\n<div class=\"jp-RenderedText jp-OutputArea-output\" data-mime-type=\"text\/plain\">\n<pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\">Number of feature vectors: 53554,\n    Length of the 0th feature vector: 2048,\n    On bits in the 0th feature vector: [   0    1    4    5   10   20   21   22   35   53   69  100  629  901\n 1242]\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">And there you have it! A set of topological fingerprints without bit collisions!<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<p style=\"margin: 1em 0 9px;margin-bottom: 0\">For clarity, here&#8217;s how the SortAndSlice class looks without cell breaks:<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div><div class=\"jp-Cell jp-CodeCell jp-Notebook-cell jp-mod-noOutputs\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\">\n<div class=\"jp-InputPrompt jp-InputArea-prompt\">In\u00a0[10]:<\/div>\n<div class=\"jp-CodeMirrorEditor jp-Editor jp-InputArea-editor\" data-type=\"inline\">\n<div class=\"cm-editor cm-s-jupyter\">\n<div class=\"highlight hl-ipython3\" style=\"background: #f8f8f8\"><pre style=\"overflow: auto;font-family: monospace;padding: 8.5px;margin: 0 0 9px;color: #333;background-color: #f5f5f5;border: 1px solid #ccc;border-radius: 2px;font-size: inherit;line-height: inherit\"><span><\/span><span class=\"k\" style=\"color: #008000;font-weight: bold\">class<\/span> <span class=\"nc\" style=\"color: #00F;font-weight: bold\">SortAndSlice<\/span><span class=\"p\">:<\/span>\n<span class=\"w\" style=\"color: #bbb\">    <\/span><span class=\"sd\" style=\"color: #BA2121;font-style: italic\">\"\"\"<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">    Class to sort and slice the output of a Morgan fingerprint generator.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\"><\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">    Args:<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        data (list[Chem.Mol]): List of RDKit molecules.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        generator (GetMorganGenerator): RDKit Morgan fingerprint generator.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        bit_length (int): Length of the output vector.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        verbose (bool): Whether to print progress.<\/span>\n\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">    Attributes:<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        generator (GetMorganGenerator): RDKit Morgan fingerprint generator.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        ao (AdditionalOutput): RDKit fingerprint generator output.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        verbose (bool): Whether to print progress.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        identifiers (dict[int, int]): Dictionary of identifiers and counts.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        encoder (dict[int, int]): Dictionary of identifiers and indices.<\/span>\n\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">    Methods:<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        get_identifiers: Collects identifiers from data.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        set_encoder: Sets the encoder to a specific length.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        encode: Encodes a molecule into a binary vector.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">    \"\"\"<\/span>\n    <span class=\"n\">encoder<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"p\">{}<\/span>\n\n    <span class=\"k\" style=\"color: #008000;font-weight: bold\">def<\/span> <span class=\"fm\">__init__<\/span><span class=\"p\">(<\/span>\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"p\">,<\/span>\n        <span class=\"n\">data<\/span><span class=\"p\">:<\/span> <span class=\"nb\" style=\"color: #008000\">list<\/span><span class=\"p\">[<\/span><span class=\"n\">Chem<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">Mol<\/span><span class=\"p\">],<\/span>\n        <span class=\"n\">generator<\/span><span class=\"p\">:<\/span> <span class=\"n\">GetMorganGenerator<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">GetMorganGenerator<\/span><span class=\"p\">(<\/span><span class=\"n\">radius<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"mi\" style=\"color: #666\">2<\/span><span class=\"p\">),<\/span>\n        <span class=\"n\">bit_length<\/span><span class=\"p\">:<\/span> <span class=\"nb\" style=\"color: #008000\">int<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"mi\" style=\"color: #666\">2048<\/span><span class=\"p\">,<\/span>\n        <span class=\"n\">verbose<\/span><span class=\"p\">:<\/span> <span class=\"nb\" style=\"color: #008000\">bool<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"kc\" style=\"color: #008000;font-weight: bold\">False<\/span><span class=\"p\">,<\/span>\n    <span class=\"p\">):<\/span>\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">generator<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">generator<\/span>\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">ao<\/span><span class=\"p\">:<\/span> <span class=\"n\">AdditionalOutput<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">AdditionalOutput<\/span><span class=\"p\">()<\/span>\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">ao<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">AllocateBitInfoMap<\/span><span class=\"p\">()<\/span>\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">verbose<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">verbose<\/span>\n\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">identifiers<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">get_identifiers<\/span><span class=\"p\">(<\/span><span class=\"n\">data<\/span><span class=\"p\">)<\/span>\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">set_encoder<\/span><span class=\"p\">(<\/span><span class=\"n\">bit_length<\/span><span class=\"p\">)<\/span>\n\n    <span class=\"k\" style=\"color: #008000;font-weight: bold\">def<\/span> <span class=\"nf\" style=\"color: #00F\">get_identifiers<\/span><span class=\"p\">(<\/span><span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"p\">,<\/span> <span class=\"n\">data<\/span><span class=\"p\">:<\/span> <span class=\"nb\" style=\"color: #008000\">list<\/span><span class=\"p\">[<\/span><span class=\"n\">Chem<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">Mol<\/span><span class=\"p\">])<\/span> <span class=\"o\" style=\"color: #666\">-&gt;<\/span> <span class=\"nb\" style=\"color: #008000\">dict<\/span><span class=\"p\">[<\/span><span class=\"nb\" style=\"color: #008000\">str<\/span><span class=\"p\">,<\/span> <span class=\"nb\" style=\"color: #008000\">int<\/span><span class=\"p\">]:<\/span>\n<span class=\"w\" style=\"color: #bbb\">        <\/span><span class=\"sd\" style=\"color: #BA2121;font-style: italic\">\"\"\"<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        Collects and sorts identifiers from molecule data.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        <\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        Args:<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">            data (list[Chem.Mol]): List of RDKit molecules.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">            <\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        Returns:<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">            dict[int, int]: Dictionary of identifiers and counts.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        \"\"\"<\/span>\n        <span class=\"n\">identifiers<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"p\">{}<\/span>\n        <span class=\"n\">pbar<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">tqdm<\/span><span class=\"p\">(<\/span><span class=\"n\">total<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"nb\" style=\"color: #008000\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">data<\/span><span class=\"p\">),<\/span> <span class=\"n\">desc<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"s1\" style=\"color: #BA2121\">'Collecting identifiers'<\/span><span class=\"p\">,<\/span> <span class=\"n\">disable<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"ow\" style=\"color: #A2F;font-weight: bold\">not<\/span> <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">verbose<\/span><span class=\"p\">)<\/span>\n        <span class=\"k\" style=\"color: #008000;font-weight: bold\">for<\/span> <span class=\"n\">mol<\/span> <span class=\"ow\" style=\"color: #A2F;font-weight: bold\">in<\/span> <span class=\"n\">data<\/span><span class=\"p\">:<\/span>\n            <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">generator<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">GetSparseFingerprint<\/span><span class=\"p\">(<\/span><span class=\"n\">mol<\/span><span class=\"p\">,<\/span> <span class=\"n\">additionalOutput<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">ao<\/span><span class=\"p\">)<\/span>\n            <span class=\"n\">bitmap<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">ao<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">GetBitInfoMap<\/span><span class=\"p\">()<\/span>\n            <span class=\"k\" style=\"color: #008000;font-weight: bold\">for<\/span> <span class=\"n\">identifier<\/span> <span class=\"ow\" style=\"color: #A2F;font-weight: bold\">in<\/span> <span class=\"n\">bitmap<\/span><span class=\"p\">:<\/span>\n                <span class=\"n\">count<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">identifiers<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">get<\/span><span class=\"p\">(<\/span><span class=\"n\">identifier<\/span><span class=\"p\">,<\/span> <span class=\"mi\" style=\"color: #666\">0<\/span><span class=\"p\">)<\/span>\n                <span class=\"n\">identifiers<\/span><span class=\"p\">[<\/span><span class=\"n\">identifier<\/span><span class=\"p\">]<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">count<\/span> <span class=\"o\" style=\"color: #666\">+<\/span> <span class=\"mi\" style=\"color: #666\">1<\/span>\n            <span class=\"n\">pbar<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">update<\/span><span class=\"p\">(<\/span><span class=\"mi\" style=\"color: #666\">1<\/span><span class=\"p\">)<\/span>\n        <span class=\"n\">pbar<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">close<\/span><span class=\"p\">()<\/span>\n        <span class=\"n\">identifiers<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"nb\" style=\"color: #008000\">dict<\/span><span class=\"p\">(<\/span><span class=\"nb\" style=\"color: #008000\">sorted<\/span><span class=\"p\">(<\/span><span class=\"n\">identifiers<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">items<\/span><span class=\"p\">(),<\/span> <span class=\"n\">key<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"k\" style=\"color: #008000;font-weight: bold\">lambda<\/span> <span class=\"n\">x<\/span><span class=\"p\">:<\/span> <span class=\"n\">x<\/span><span class=\"p\">[<\/span><span class=\"mi\" style=\"color: #666\">1<\/span><span class=\"p\">],<\/span> <span class=\"n\">reverse<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"kc\" style=\"color: #008000;font-weight: bold\">True<\/span><span class=\"p\">))<\/span>\n        <span class=\"k\" style=\"color: #008000;font-weight: bold\">return<\/span> <span class=\"n\">identifiers<\/span>\n    \n    <span class=\"k\" style=\"color: #008000;font-weight: bold\">def<\/span> <span class=\"nf\" style=\"color: #00F\">set_encoder<\/span><span class=\"p\">(<\/span><span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"p\">,<\/span> <span class=\"n\">bit_length<\/span><span class=\"p\">:<\/span> <span class=\"nb\" style=\"color: #008000\">int<\/span><span class=\"p\">):<\/span>\n<span class=\"w\" style=\"color: #bbb\">        <\/span><span class=\"sd\" style=\"color: #BA2121;font-style: italic\">\"\"\"<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        Slices substructure identifiers to a specific length enumerates them.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        Sets the encoder attribute to the enumerated identifiers.<\/span>\n\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        Args:<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">            bit_length (int): Length of the output vector.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        \"\"\"<\/span>\n        <span class=\"k\" style=\"color: #008000;font-weight: bold\">if<\/span> <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">verbose<\/span><span class=\"p\">:<\/span>\n            <span class=\"nb\" style=\"color: #008000\">print<\/span><span class=\"p\">(<\/span><span class=\"sa\">f<\/span><span class=\"s1\" style=\"color: #BA2121\">'Setting bit length of encoder to a max of <\/span><span class=\"si\" style=\"color: #B68;font-weight: bold\">{<\/span><span class=\"n\">bit_length<\/span><span class=\"si\" style=\"color: #B68;font-weight: bold\">}<\/span><span class=\"s1\" style=\"color: #BA2121\">.'<\/span><span class=\"p\">)<\/span>\n        <span class=\"n\">encoder<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"p\">{}<\/span>\n        <span class=\"k\" style=\"color: #008000;font-weight: bold\">for<\/span> <span class=\"n\">i<\/span><span class=\"p\">,<\/span> <span class=\"n\">k<\/span> <span class=\"ow\" style=\"color: #A2F;font-weight: bold\">in<\/span> <span class=\"nb\" style=\"color: #008000\">enumerate<\/span><span class=\"p\">(<\/span><span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">identifiers<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">keys<\/span><span class=\"p\">()):<\/span>\n            <span class=\"k\" style=\"color: #008000;font-weight: bold\">if<\/span> <span class=\"n\">i<\/span> <span class=\"o\" style=\"color: #666\">&gt;=<\/span> <span class=\"n\">bit_length<\/span><span class=\"p\">:<\/span>\n                <span class=\"k\" style=\"color: #008000;font-weight: bold\">break<\/span>\n            <span class=\"n\">encoder<\/span><span class=\"p\">[<\/span><span class=\"n\">k<\/span><span class=\"p\">]<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">i<\/span>\n\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">encoder<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">encoder<\/span>\n        <span class=\"k\" style=\"color: #008000;font-weight: bold\">if<\/span> <span class=\"nb\" style=\"color: #008000\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">encoder<\/span><span class=\"p\">)<\/span> <span class=\"o\" style=\"color: #666\">&lt;<\/span> <span class=\"n\">bit_length<\/span><span class=\"p\">:<\/span>\n            <span class=\"n\">warnings<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">warn<\/span><span class=\"p\">(<\/span><span class=\"sa\">f<\/span><span class=\"s1\" style=\"color: #BA2121\">'Encoder is only <\/span><span class=\"si\" style=\"color: #B68;font-weight: bold\">{<\/span><span class=\"nb\" style=\"color: #008000\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">encoder<\/span><span class=\"p\">)<\/span><span class=\"si\" style=\"color: #B68;font-weight: bold\">}<\/span><span class=\"s1\" style=\"color: #BA2121\"> bits long.'<\/span><span class=\"p\">)<\/span>\n\n        <span class=\"k\" style=\"color: #008000;font-weight: bold\">if<\/span> <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">verbose<\/span><span class=\"p\">:<\/span>\n            <span class=\"nb\" style=\"color: #008000\">print<\/span><span class=\"p\">(<\/span><span class=\"sa\">f<\/span><span class=\"s1\" style=\"color: #BA2121\">'Encoder set to <\/span><span class=\"si\" style=\"color: #B68;font-weight: bold\">{<\/span><span class=\"nb\" style=\"color: #008000\">len<\/span><span class=\"p\">(<\/span><span class=\"n\">encoder<\/span><span class=\"p\">)<\/span><span class=\"si\" style=\"color: #B68;font-weight: bold\">}<\/span><span class=\"s1\" style=\"color: #BA2121\"> bits.'<\/span><span class=\"p\">)<\/span>\n\n    <span class=\"k\" style=\"color: #008000;font-weight: bold\">def<\/span> <span class=\"nf\" style=\"color: #00F\">encode<\/span><span class=\"p\">(<\/span><span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"p\">,<\/span> <span class=\"n\">mol<\/span><span class=\"p\">:<\/span> <span class=\"n\">Chem<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">Mol<\/span><span class=\"p\">)<\/span> <span class=\"o\" style=\"color: #666\">-&gt;<\/span> <span class=\"n\">np<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">ndarray<\/span><span class=\"p\">:<\/span>\n<span class=\"w\" style=\"color: #bbb\">        <\/span><span class=\"sd\" style=\"color: #BA2121;font-style: italic\">\"\"\"<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        Encodes a molecule into a binary sort and slice vector.<\/span>\n\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        Args:<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">            mol (Chem.Mol): RDKit molecule.<\/span>\n\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        Returns:<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">            np.ndarray: Binary vector indicating substructure presence.<\/span>\n<span class=\"sd\" style=\"color: #BA2121;font-style: italic\">        \"\"\"<\/span>\n        <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">generator<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">GetSparseFingerprint<\/span><span class=\"p\">(<\/span><span class=\"n\">mol<\/span><span class=\"p\">,<\/span> <span class=\"n\">additionalOutput<\/span><span class=\"o\" style=\"color: #666\">=<\/span><span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">ao<\/span><span class=\"p\">)<\/span>\n        <span class=\"n\">bitmap<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">ao<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">GetBitInfoMap<\/span><span class=\"p\">()<\/span>\n        <span class=\"n\">out<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"n\">np<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">zeros<\/span><span class=\"p\">(<\/span><span class=\"nb\" style=\"color: #008000\">len<\/span><span class=\"p\">(<\/span><span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">encoder<\/span><span class=\"p\">))<\/span>\n        <span class=\"k\" style=\"color: #008000;font-weight: bold\">for<\/span> <span class=\"n\">identifier<\/span> <span class=\"ow\" style=\"color: #A2F;font-weight: bold\">in<\/span> <span class=\"n\">bitmap<\/span><span class=\"p\">:<\/span>\n            <span class=\"k\" style=\"color: #008000;font-weight: bold\">if<\/span> <span class=\"n\">identifier<\/span> <span class=\"ow\" style=\"color: #A2F;font-weight: bold\">in<\/span> <span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">encoder<\/span><span class=\"p\">:<\/span>\n                <span class=\"n\">out<\/span><span class=\"p\">[<\/span><span class=\"bp\" style=\"color: #008000\">self<\/span><span class=\"o\" style=\"color: #666\">.<\/span><span class=\"n\">encoder<\/span><span class=\"p\">[<\/span><span class=\"n\">identifier<\/span><span class=\"p\">]]<\/span> <span class=\"o\" style=\"color: #666\">=<\/span> <span class=\"mi\" style=\"color: #666\">1<\/span>\n        <span class=\"k\" style=\"color: #008000;font-weight: bold\">return<\/span> <span class=\"n\">out<\/span>\n<\/pre><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"jp-Cell jp-MarkdownCell jp-Notebook-cell\">\n<div class=\"jp-Cell-inputWrapper\">\n<div class=\"jp-Collapser jp-InputCollapser jp-Cell-inputCollapser\">\n<\/div>\n<div class=\"jp-InputArea jp-Cell-inputArea\"><div class=\"jp-InputPrompt jp-InputArea-prompt\">\n<\/div><div class=\"jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput\" data-mime-type=\"text\/markdown\">\n<h3 id=\"References:\" style='font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif;font-weight: 500;line-height: 1.1;color: inherit;margin-top: 18px;margin-bottom: 9px;font-size: 23px'>References:<a class=\"anchor-link\" href=\"\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">\u00b6<\/a><\/h3><ol style=\"margin-top: 0;margin-bottom: 9px\">\n<li>Dablander, M. et al. (2024) Sort and Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints. <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2403.17954\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">https:\/\/doi.org\/10.48550\/arXiv.2403.17954<\/a><\/li>\n<li>Rogers, D. and Hahn, M. (2010) Extended-Connectivity Fingerprints. <a href=\"https:\/\/doi.org\/10.1021\/ci100050t\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">https:\/\/doi.org\/10.1021\/ci100050t<\/a><\/li>\n<li>Weininger, D., Weininger, A., Weininger, J.L. (1989) SMILES. 2. Algorithm for generation of unique SMILES notation. <a href=\"https:\/\/doi.org\/10.1021\/ci00062a008\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">https:\/\/doi.org\/10.1021\/ci00062a008<\/a><\/li>\n<li>Klarner, L. (2022) Exploring Topological Fingerprints. <a href=\"https:\/\/www.blopig.com\/blog\/2022\/06\/exploring-topological-fingerprints-in-rdkit\/\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">https:\/\/www.blopig.com\/blog\/2022\/06\/exploring-topological-fingerprints-in-rdkit\/<\/a><\/li>\n<li>Dablander, M. (2022) How to turn a SMILES string into an extended-connectivity fingerprint using RDKit. <a href=\"https:\/\/www.blopig.com\/blog\/2022\/11\/how-to-turn-a-smiles-string-into-an-extended-connectivity-fingerprint-using-rdkit\/\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">https:\/\/www.blopig.com\/blog\/2022\/11\/how-to-turn-a-smiles-string-into-an-extended-connectivity-fingerprint-using-rdkit\/<\/a><\/li>\n<li>Landrum, G. (2023) FingerprintGenerator tutorial. <a href=\"https:\/\/greglandrum.github.io\/rdkit-blog\/posts\/2023-01-18-fingerprint-generator-tutorial.html\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">https:\/\/greglandrum.github.io\/rdkit-blog\/posts\/2023-01-18-fingerprint-generator-tutorial.html<\/a><\/li>\n<li>Ramakrishnan, R. et al (2014) Quantum chemistry structures and properties of 134 kilo molecules. <a href=\"https:\/\/doi.org\/10.1038\/sdata.2014.22\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">https:\/\/doi.org\/10.1038\/sdata.2014.22<\/a><\/li>\n<li>Wu, Z. et al. (2018) MoleculeNet: a benchmark for molecular machine learning. <a href=\"https:\/\/pubs.rsc.org\/en\/content\/articlelanding\/2018\/sc\/c7sc02664a\" style=\"background-color: transparent;color: #337ab7;text-decoration: none\">https:\/\/pubs.rsc.org\/en\/content\/articlelanding\/2018\/sc\/c7sc02664a<\/a><\/li>\n<\/ol>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/main>\n<p align=\"center\" style=\"font-size: 12px;color: #9d9999\">WordPress conversion from tutorial.ipynb by <A HREF=\"https:\/\/github.com\/bennylp\/nb2wp\">nb2wp<\/A> v0.3.1<\/p>\n\n\n\n<p>GitHub repo: <a href=\"https:\/\/github.com\/smkyrle\/sort-and-slice-tutorial\">https:\/\/github.com\/smkyrle\/sort-and-slice-tutorial<\/a><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Background\u00b6 Sort and Slice (SNS) was developed by a former OPIGlet, Markus, as a method for improving Extended Connectivity Fingerprints (ECFPs) by overcoming bit collisions. ECFPs are a form of topological fingerprint which denote the absence and presence of circular substructures in a molecule. The steps for deriving an ECFP from a molecule are as [&hellip;]<\/p>\n","protected":false},"author":118,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[187,227,201],"tags":[711,680,152,134,786],"ppma_author":[760],"class_list":["post-11543","post","type-post","status-publish","format-standard","hentry","category-cheminformatics","category-python-code","category-small-molecules","tag-ecfps","tag-extended-connectivity-fingerprints","tag-python","tag-small-molecules","tag-sort-and-slice"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":760,"user_id":118,"is_guest":0,"slug":"sam","display_name":"Sam Money-Kyrle","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/784870e2ed5304f12f11366dad56cbf1c0b9aa63bd80021ae235ba5f30536a12?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/11543","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/118"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=11543"}],"version-history":[{"count":6,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/11543\/revisions"}],"predecessor-version":[{"id":12102,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/11543\/revisions\/12102"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=11543"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=11543"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=11543"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=11543"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}