{"id":13824,"date":"2025-12-10T15:51:16","date_gmt":"2025-12-10T15:51:16","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=13824"},"modified":"2025-12-10T15:51:19","modified_gmt":"2025-12-10T15:51:19","slug":"chemical-languages-in-machine-learning","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2025\/12\/chemical-languages-in-machine-learning\/","title":{"rendered":"Chemical Languages in Machine Learning"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">For more than a century, chemists have been trying to squeeze the beautifully messy, quantum-smeared reality of molecules into tidy digital boxes, \u201cformats\u201d such as line notations, connection tables, coordinate files, or even the vaguely hieroglyphic Wiswesser Line Notation. These formats weren\u2019t designed for machine learning; some weren\u2019t even designed for computers. And yet, they\u2019ve become the wedged into the backbones of modern drug discovery, materials design and computational chemistry.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The emergent use of large language models and natural language processing in chemistry posits the immediate question: <strong>What does it mean for a molecule to have a \u201clanguage,\u201d and how should machines speak it?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">if molecules are akin to words and sentences, what alphabet and grammatical rules should they follow?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What follows is a tour through the evolving world of chemical languages, why we use them, why our old representations keep breaking our shiny new models, and what might replace them.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">Computational Representations of Molecules<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The representation of chemical structure constitutes the fundamental epistemological problem of cheminformatics (often <em>chemoinformatics<\/em>): how to translate the physical reality of a molecule, a probabilistic cloud of particles governed by quantum mechanics, into a discrete, digital format amenable to computation, and as relevant right now (you <em>are<\/em> reading this,<em> aren\u2019t <\/em>you?) human readers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For the majority of the 20th century, this translation was primarily archival, designed to facilitate the storage, indexing, and retrieval of static chemical data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The dominant paradigms from computational chemistry research, such as the connection table (e.g., atom and bond tables) and the Wiswesser Line Notation (WLN, considered the first chemical description language), were optimised for database efficiency and human manual entry, respectively. These methods treat molecules as graph-like objects, consisting of nodes (atoms) and edges (bonds). The nodes and edges of such graphs can then have additional metadata to specify features such as charge (atoms) or strength of covalent bonds (i.e. single vs. double bonds). These graphs are then serialised into 1-dimensional (1D) representations of the 2D graph, to be stored in a computer system.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Other formats &#8211; typically originating from industry &#8211; such as those based on the original MDL Information Systems include the .mol, Tripos\u2019s .mol2, and .sdf specifications also exist. These formats additionally serialise the 3D information of molecules (e.g. positions in Euclidean space), and still remain the most commonly used, alongside the .pdb specification; now succeeded by the .mmcif specification (also referred to as the PDBx format).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Which to use is a hard question. As Pat Walters states in his Cheminformatics Rules to Live By:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Never use a PDB (.pdb) file for a small molecule, this includes PDBQT.<\/li>\n\n\n\n<li>There is no good reason to use a .mol2 file, ever.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">So therefore, .mol\/.sdf is the way? (the .sdf format is actually multiple Molfile (.mol) with extra metadata (<em>kinda<\/em>)) Well, in the various domains of computational chemistry many algorithms that are state-of-the-art still use older formats, either because the software is an ancient workhorse that \u201cisn\u2019t broke\u201d, too niche for anyone to update, or made by developers who are tired of learning and implementing molecular formats. So I find the Rules to Live By to be rules-of-thumb.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Linguistic Turn in Chemical Representation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">However, the advent of computer aided drug discovery (CADD), machine learning (\u201dAI\u201d) and other computational, data-driven approaches, has meant it is no longer sufficient to merely catalogue a molecule; the representation must now support algorithmic manipulation, interpolation in latent spaces, and <em>de novo<\/em> generation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The Simplified Molecular Input Line Entry System (SMILES) first introduced by David Weininger&#8217;s 1988 paper, and further developed by Daylight Chemical Information Systems has prevailed as the dominant specification for representing complex molecules in a compact, human-readable, encoding suitable for computational applications. It has a <em>natural language-like<\/em> representation, and now has widely adopted, free and open source (FOSS) specifications (e.g. OpenSMILES, 2007) that enables developers and scientists to leverage the format. Though lacking in explicit 3D information, numerous algorithms exist to convert (really, <em>calculate<\/em>) between SMILES and other formats such as .sdf. That is to say 3D structures generated from SMILES are predictions, not reconstructions of a ground truth.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As is the case for research considering protein sequences and DNA sequences, innovation in natural language processing (NLP) leads to application in chemical languages. BERT, a transformer-based language model architecture quickly became protein language models (e.g. BERT-protein), and chemistry language models (e.g. ChemBERTa). Though other representations (and associated machine learning architectures) of molecules exist &#8211; and at times more popular &#8211; such as graphs or all-atom encodings (think 3D coordinates), we are concerned with the language-like representation here.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Why language models you ask? SMILES, and even its predecessor WLN, look a lot like words in a language, as has the analogy been made about amino acids being letters\/words in proteins (e.g. <em>ARNDCQE\u2026<\/em>), and similar for nucleotides in a DNA sequence (e.g. <em>ACGT<\/em>), we extend such analogy to atoms and bonds in molecules.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For example, here is the molecule Indomethacin, and the corresponding WLN, with a few components highlighted.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/Screenshot-from-2025-12-10-02-15-49.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"304\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/Screenshot-from-2025-12-10-02-15-49.png?resize=625%2C304&#038;ssl=1\" alt=\"\" class=\"wp-image-13826\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/Screenshot-from-2025-12-10-02-15-49.png?w=676&amp;ssl=1 676w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/Screenshot-from-2025-12-10-02-15-49.png?resize=300%2C146&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/Screenshot-from-2025-12-10-02-15-49.png?resize=624%2C304&amp;ssl=1 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In SMILES, the molecule would be the (at least to me) much more intuitive sequence below:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CC1=C(C2=C(N1C(=O)C3=CC=C(C=C3)Cl)C=CC(=C2)OC)CC(=O)O<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Some components can easily mapped between the two notations, for example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>T56 DNJ -&gt; N1=CC=C2C1=CC=CC2 (Indole)\nB1VQ -&gt; CCC(=O)O (Propionic acid)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">But in reality, the SMILES <em>could <\/em>be:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Cc1n(c2ccc(OC)cc2c1CC(O)=O)C(c1ccc(Cl)cc1)=O\nO=C(c1ccc(cc1)Cl)n1c2c(c(CC(=O)O)c1C)cc(cc2)OC\nc1(OC)cc2c(CC(=O)O)c(C)n(c2cc1)C(c1ccc(Cl)cc1)=O\nc1(cc2c(CC(O)=O)c(C)n(c2cc1)C(c1ccc(cc1)Cl)=O)OC\nc1(ccc(C(n2c3c(c(c2C)CC(O)=O)cc(cc3)OC)=O)cc1)Cl\nc12c(c(n(c2ccc(c1)OC)C(c1ccc(cc1)Cl)=O)C)CC(=O)O\nc12n(c(C)c(CC(O)=O)c2cc(OC)cc1)C(c1ccc(Cl)cc1)=O\nc1c(Cl)ccc(c1)C(=O)n1c2ccc(cc2c(CC(O)=O)c1C)OC\nc1cc(Cl)ccc1C(n1c(c(c2cc(ccc21)OC)CC(O)=O)C)=O\nc1cc(ccc1C(=O)n1c2ccc(OC)cc2c(c1C)CC(=O)O)Cl\n...<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Honey, I Canonicalised the Molecules<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">From the second example above, it is evident there can be <em>multiple<\/em> valid SMILES for a given molecule. This fact arises that there is no natural ordering to a graph, including molecular graphs, and has numerous implications, particularly in downstream computational tasks. For example, SMILES strings cannot be used as primary keys in databases that aim to store unique molecules, and machine learning algorithms based on SMILES must learn that multiple \u201cvalid\u201d representations of a given molecule exist.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To overcome this <s>problem<\/s> \u201cfeature\u201d, the original creators of SMILES sought about \u201ccanonicalising\u201d the process, so that a molecule could be encoded into a unique SMILES. This algorithm, named CANGEN is a two-step process. CANON; CANonicalisation, GENES; GENerate the unique SMILES.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Unfortunately, this algorithm is not correct (in an algorithmic sense). Molecules, particularly those that have a lot of symmetry (e.g. N, N-Diallylmelamine) have been used to prove the CANGEN algorithm does not satisfy the creation of unique SMILES.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Several canonicalisation algorithms exist, but there is absolutely no guarantee that a canonical SMILES generated by any two approaches is the same, only ever use the canonical SMILES generated by a single algorithm (including version!). Even the most widely used cheminformatics Python\/C++ library, <a href=\"https:\/\/github.com\/rdkit\/rdkit\/issues\/8775\">the RDKit, maintains a list of known molecules that fail canonicalisation<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now readers, I hear you ask, if SMILES cannot be canonical, what do we use? The short answer is the International Chemical Identifier (InChI), which are unique*. Well, really, <em>long<\/em> answer as they are more characters than SMILES, which hinders human-readability, storage and downstream use in algorithms. It is also common to see hashed InChI(s), known as InChIKey(s) to facilitate searching but run the risk of any hash, hash collisions, which are known to occur in InChIKey datasets.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">*Unique as long as all <em>optional<\/em> information such as stereochemistry or tautomeric layers is explicitly provided.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Getting What We Want From a Molecular Language Representation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Okay, so, it appears we cannot always have a beautiful <em>canon<\/em> SMILES. But is <em>that<\/em> the reason my machine learning models suck? Probably not, or at least not the sole reason. In fact, some studies have shown that using multiple valid SMILES for the same molecule &#8211; \u201cSMILES augmentation\u201d &#8211; <a href=\"https:\/\/www.nature.com\/articles\/s42256-024-00821-x\">or even invalid<\/a> SMILES improves performance.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But maybe by identifying some other issues with SMILES, we can go about determining what it is that we want from a molecular language representation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Generative models, so hot right now. The strict syntax of SMILES creates a critical failure mode in generative chemistry models. A valid SMILES string requires perfectly balanced parentheses and matched ring numbers. Generative models, which sample tokens probabilistically, often fail to adhere to these rigid grammatical constraints. This is also a huge issue with formats like InChI.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Machine learning approaches using the SMILES format have struggled with, for example:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Stereochemistry<\/li>\n\n\n\n<li>Rare tokens (often a dataset issue)<\/li>\n\n\n\n<li>Molecular diversity collapse (could be a dataset issue), where the models fail to generate diverse molecules, or fail to traverse molecular space.<\/li>\n\n\n\n<li>The inability to encode 3D information directly (though <em>some<\/em> have proposed <em>some<\/em> models \u201clearn\u201d or encode 3D information indirectly).<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Luckily for us, there have been variants of SMILES, and new chemical language specifications created to address such problems:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">DeepSMILES<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The core philosophy of DeepSMILES was to reform the SMILES syntax to be more \u201cmachine learning-friendly\u201d by converting absolute, long-range dependencies into relative, local ones. The goal was to improve the validity of sequences produced by generative models without sacrificing the information content of the original representation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In standard SMILES, rings are closed by matching labels (e.g., the pair of 1s in C<strong>1<\/strong>&#8230;C<strong>1<\/strong>). DeepSMILES replaces the second label with a single digit indicating the size of the ring traversal (the number of atoms back in the sequence to connect to). For example, Indomethacin, if originally encoded as the SMILES:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CC1=C(C2=C(N1C(=O)C3=CC=C(C=C3)Cl)C=CC(=C2)OC)CC(=O)O<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">becomes the DeepSMILES:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>CC=CC=CN5C=O)C=CC=CC=C6))Cl)))))))C=CC=C6)OC)))))))CC=O)O<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The characters 5, and 6 instructs the parser to create a bond between the atom they\u2019re after and the atom five and six positions prior in the sequence, respectively. DeepSMILES also implements a conceptually similar change in how branches are encoded, evidenced by a single closing bracket \u201c)\u201d, rather than pairs as seen in the original SMILES.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Therefore, language models are no longer required to maintain \u201clong-term memory\u201d (context) to keep track of concepts such as rings as is the case for base SMILES or InChI, and improves decoding validity as a ring is now represented by a single character.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The trade-off for DeepSMILES is that the strings themselves are slightly less readable. SMILES (at least, <em>I<\/em> find) are quite easily read left-to-right, but in the case of DeepSMILES a \u201clook-back\u201d (with counting!) is required to, for example, close rings. DeepSMILES also only addresses <em>syntactic<\/em> validity (parsing errors), but does not address <em>semantic<\/em> validity (think <em>chemical<\/em> <em>validity<\/em>) such as valence. Additionally, models that generate DeepSMILES still share many issues as SMILES models, such as difficulty with stereochemistry, molecular diversity collapse and the specification lacks canonicalisation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">SELFIES, and Extensions<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The central innovation of SELFIES is the guarantee of 100% robustness: every possible permutation of tokens in the SELFIES alphabet corresponds to a valid molecular graph. This contrasts sharply with SMILES, or DeepSMILES, where a random string is almost certainly invalid.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">They also look quite different, as a text representation. Once again, Indomethacin:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#091;C]&#091;C]&#091;=C]&#091;Branch2]&#091;Ring2]&#091;C]&#091;C]&#091;=C]&#091;Branch2]&#091;Ring1]&#091;Ring1]&#091;N]&#091;Ring1]&#091;Branch1]&#091;C]&#091;=Branch1]&#091;C]&#091;=O]&#091;C]&#091;=C]&#091;C]&#091;=C]&#091;Branch1]&#091;Branch1]&#091;C]&#091;=C]&#091;Ring1]&#091;=Branch1]&#091;Cl]&#091;C]&#091;=C]&#091;C]&#091;=Branch1]&#091;Ring2]&#091;=C]&#091;Ring1]&#091;S]&#091;O]&#091;C]&#091;C]&#091;C]&#091;=Branch1]&#091;C]&#091;=O]&#091;O]<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">As you can see, <span style=\"text-decoration: underline\">they are huge.<\/span><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">SELFIES ensures validity by construction, but this sometimes results in chemically unrealistic or unexpected structures.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If a token requests a double bond (e.g., [=C]) but the current atom (e.g., Oxygen) only has one valence slot remaining, the parsing engine automatically downgrades the bond to a single bond or ignores the instruction entirely. This ensures that the octet rule and other valence constraints are never violated, but feels like we are cheating a bit, especially given that physical constraints can still be unmet.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To address the length of the SELFIES, there is an extension called GroupSELFIES that can create &#8211; as the name implies &#8211; SELFIES containing groups. For example, if we add to the <em>grammar<\/em> of the SELFIES a <strong>fragment<\/strong> that matches 6-membered rings with some flexibility, our new GroupSELFIES would be:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&#091;C]&#091;C]&#091;=C]&#091;Branch]&#091;:3fragment]&#091;Ring1]&#091;O]&#091;C]&#091;pop]&#091;Ring2]&#091;N]&#091;Branch]&#091;C]&#091;=Branch]&#091;=O]&#091;pop]&#091;:0fragment]&#091;Ring2]&#091;Cl]&#091;pop]&#091;pop]&#091;Ring1]&#091;N]&#091;pop]&#091;pop]&#091;C]&#091;C]&#091;=Branch]&#091;=O]&#091;pop]&#091;O]<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Specifically: a 1,3,5-trisubstituted benzene, where the wildcard substituents must also be atoms of the ring, or as a SMARTS:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>C1=C(*1)C=C(*1)C(*1)=C1*1<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">GroupSELFIES proves itself to be quite useful if you are able to group substructures together, but the groups must either be manually specified (time consuming, requires expert domain knowledge), or extracted from a dataset (computationally exhaustive, exponentially scaling), and there is more to be done to understand the ideal grammar and group size for machine learning applications.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, SELFIES (e.g. STONED-SELFIES) have successfully been shown to improve semantic validity, and suffer less from the molecular diversity collapse.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Emerging and Domain Specific Variants<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Briefly,<a href=\"https:\/\/www.nature.com\/articles\/s41467-024-49388-6\"> t-SMILES introduced by Wu et al. 2024<\/a> is a tree-based fragmented SMILES variant, which they report outperforms classical SMILES, DeepSMILES, SELFIES and baseline models on some tasks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another fragment approach, <a href=\"https:\/\/github.com\/datamol-io\/safe\">from Noutahi et al. 2023 is SAFE<\/a>, which was <a href=\"https:\/\/arxiv.org\/pdf\/2310.10773\">shown to outperform GroupSELFIES<\/a>, and creates fragment embeddings based on probabilistic fragmentation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Most recently, fragSMILES from <a href=\"https:\/\/www.nature.com\/articles\/s42004-025-01423-3\">Mastrolorito et al. 2025<\/a> boasts improved performance over t-SMILES, and provides a grammar-driven fragment-based tokenisation of SMILES.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/pubs.acs.org\/doi\/10.1021\/acscentsci.9b00476\">BigSMILES<\/a> and <a href=\"https:\/\/github.com\/kuennethgroup\/psmiles)\">pSMILES<\/a> are extensions designed for macromolecules and polymers, and have <a href=\"https:\/\/pubs.acs.org\/doi\/10.1021\/acs.macromol.5c00604\">shown to outperform at least classical SMILES<\/a> in their relevant domain-specific tasks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">ClearSMILES is a variant, similar in concept but different in implementation, to DeepSMILES from <a href=\"https:\/\/pubs.acs.org\/doi\/pdf\/10.1021\/acs.jcim.4c02261\">Reboul et al. 2025<\/a> that improves performance in generative models.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">A Note on Tokenisation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The following section is brief, as I have deemed it largely outside of the scope for this already quite long blopig article. However, it would be remiss of me to fail to mention that as with any language model, molecular language models critically depend on tokenization schemes. Attention (no pun intended) must be paid when implementing these models. For example, the C in the English spellings of <strong>C<\/strong>arbon and <strong>C<\/strong>hlorine, are fundamentally the same letter, C. However, the \u201cC\u201d\u2019s in the Indomethacin SMILES string (once more for reference):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code><strong>C<\/strong>C1=C(C2=C(N1C(=O)C3=CC=C(C=C3)<strong>C<\/strong>l)C=CC(=C2)OC)CC(=O)O<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">are overloaded, despite being fundamentally different atoms, and tokenisation schemes such as byte pair encoding (BPE) may encode them as the same token (i.e. the model \u201csees\u201d them as the same). Similarly, such tokenisation strategies may create token pairings like \u201dC)(\u201d, <a href=\"https:\/\/www.nature.com\/articles\/s41598-024-76440-8\">which may contribute to worse performance and chemical invalidity.<\/a> For more, I would encourage you to read the <a href=\"https:\/\/arxiv.org\/abs\/2409.15370\">preprint by Wadell et al.<\/a>, which systematically compares tokenisers for chemistry language models, and introduces their own, Smirk which I have used with great success.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Chemical languages are having their Renaissance.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The transition from archival chemical formats (think filing cabinets) to machine-learning-oriented representations marks a genuine \u201clinguistic turn\u201d in cheminformatics. As SMILES, DeepSMILES, SELFIES, and their successors illustrate, the demands of generative modelling and structured learning require representations that are both syntactically robust and semantically grounded. Yet tokenisation, canonicalisation, and 3D information remain persistent challenges.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Rather than searching for a universally optimal encoding, it may be more fruitful to view chemical languages as task-specific tools. Different problems, such as generation, property prediction, and exploration of chemical space may require different grammars, abstractions, or model architectures.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In my opinion, no single representation is perfect, and none will \u201cwin\u201d outright, but I dream of a future where generative models don\u2019t collapse into chemical gibberish, and I have a feeling in that reality the representations will need to continue to evolve with the algorithms that use them.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In many ways, if the 20th century was about writing down molecules, <strong>the start of the 21st century has been about teaching machines to write them back.<\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>For more than a century, chemists have been trying to squeeze the beautifully messy, quantum-smeared reality of molecules into tidy digital boxes, \u201cformats\u201d such as line notations, connection tables, coordinate files, or even the vaguely hieroglyphic Wiswesser Line Notation. These formats weren\u2019t designed for machine learning; some weren\u2019t even designed for computers. And yet, they\u2019ve [&hellip;]<\/p>\n","protected":false},"author":148,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[633,187,632,189,201],"tags":[],"ppma_author":[904],"class_list":["post-13824","post","type-post","status-publish","format-standard","hentry","category-ai","category-cheminformatics","category-deep-learning","category-machine-learning","category-small-molecules"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":904,"user_id":148,"is_guest":0,"slug":"hasson","display_name":"Alexander Hasson","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/a18b0a615703cb20a58475f18f331eebaef289fadb6f4794f53d0f5a15f464c4?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Hasson","first_name":"Alexander","job_title":"","description":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13824","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/148"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=13824"}],"version-history":[{"count":2,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13824\/revisions"}],"predecessor-version":[{"id":13827,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13824\/revisions\/13827"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=13824"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=13824"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=13824"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=13824"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}