Chemical Languages in Machine Learning | Oxford Protein Informatics Group

For more than a century, chemists have been trying to squeeze the beautifully messy, quantum-smeared reality of molecules into tidy digital boxes, “formats” such as line notations, connection tables, coordinate files, or even the vaguely hieroglyphic Wiswesser Line Notation. These formats weren’t designed for machine learning; some weren’t even designed for computers. And yet, they’ve become the wedged into the backbones of modern drug discovery, materials design and computational chemistry.

The emergent use of large language models and natural language processing in chemistry posits the immediate question: What does it mean for a molecule to have a “language,” and how should machines speak it?

if molecules are akin to words and sentences, what alphabet and grammatical rules should they follow?

What follows is a tour through the evolving world of chemical languages, why we use them, why our old representations keep breaking our shiny new models, and what might replace them.

Computational Representations of Molecules

The representation of chemical structure constitutes the fundamental epistemological problem of cheminformatics (often chemoinformatics): how to translate the physical reality of a molecule, a probabilistic cloud of particles governed by quantum mechanics, into a discrete, digital format amenable to computation, and as relevant right now (you are reading this, aren’t you?) human readers.

For the majority of the 20th century, this translation was primarily archival, designed to facilitate the storage, indexing, and retrieval of static chemical data.

The dominant paradigms from computational chemistry research, such as the connection table (e.g., atom and bond tables) and the Wiswesser Line Notation (WLN, considered the first chemical description language), were optimised for database efficiency and human manual entry, respectively. These methods treat molecules as graph-like objects, consisting of nodes (atoms) and edges (bonds). The nodes and edges of such graphs can then have additional metadata to specify features such as charge (atoms) or strength of covalent bonds (i.e. single vs. double bonds). These graphs are then serialised into 1-dimensional (1D) representations of the 2D graph, to be stored in a computer system.

Other formats – typically originating from industry – such as those based on the original MDL Information Systems include the .mol, Tripos’s .mol2, and .sdf specifications also exist. These formats additionally serialise the 3D information of molecules (e.g. positions in Euclidean space), and still remain the most commonly used, alongside the .pdb specification; now succeeded by the .mmcif specification (also referred to as the PDBx format).

Which to use is a hard question. As Pat Walters states in his Cheminformatics Rules to Live By:

Never use a PDB (.pdb) file for a small molecule, this includes PDBQT.
There is no good reason to use a .mol2 file, ever.

So therefore, .mol/.sdf is the way? (the .sdf format is actually multiple Molfile (.mol) with extra metadata (kinda)) Well, in the various domains of computational chemistry many algorithms that are state-of-the-art still use older formats, either because the software is an ancient workhorse that “isn’t broke”, too niche for anyone to update, or made by developers who are tired of learning and implementing molecular formats. So I find the Rules to Live By to be rules-of-thumb.

The Linguistic Turn in Chemical Representation

However, the advent of computer aided drug discovery (CADD), machine learning (”AI”) and other computational, data-driven approaches, has meant it is no longer sufficient to merely catalogue a molecule; the representation must now support algorithmic manipulation, interpolation in latent spaces, and de novo generation.

The Simplified Molecular Input Line Entry System (SMILES) first introduced by David Weininger’s 1988 paper, and further developed by Daylight Chemical Information Systems has prevailed as the dominant specification for representing complex molecules in a compact, human-readable, encoding suitable for computational applications. It has a natural language-like representation, and now has widely adopted, free and open source (FOSS) specifications (e.g. OpenSMILES, 2007) that enables developers and scientists to leverage the format. Though lacking in explicit 3D information, numerous algorithms exist to convert (really, calculate) between SMILES and other formats such as .sdf. That is to say 3D structures generated from SMILES are predictions, not reconstructions of a ground truth.

As is the case for research considering protein sequences and DNA sequences, innovation in natural language processing (NLP) leads to application in chemical languages. BERT, a transformer-based language model architecture quickly became protein language models (e.g. BERT-protein), and chemistry language models (e.g. ChemBERTa). Though other representations (and associated machine learning architectures) of molecules exist – and at times more popular – such as graphs or all-atom encodings (think 3D coordinates), we are concerned with the language-like representation here.

Why language models you ask? SMILES, and even its predecessor WLN, look a lot like words in a language, as has the analogy been made about amino acids being letters/words in proteins (e.g. ARNDCQE…), and similar for nucleotides in a DNA sequence (e.g. ACGT), we extend such analogy to atoms and bonds in molecules.

For example, here is the molecule Indomethacin, and the corresponding WLN, with a few components highlighted.

In SMILES, the molecule would be the (at least to me) much more intuitive sequence below:

CC1=C(C2=C(N1C(=O)C3=CC=C(C=C3)Cl)C=CC(=C2)OC)CC(=O)O

Some components can easily mapped between the two notations, for example:

T56 DNJ -> N1=CC=C2C1=CC=CC2 (Indole)
B1VQ -> CCC(=O)O (Propionic acid)

But in reality, the SMILES could be:

Cc1n(c2ccc(OC)cc2c1CC(O)=O)C(c1ccc(Cl)cc1)=O
O=C(c1ccc(cc1)Cl)n1c2c(c(CC(=O)O)c1C)cc(cc2)OC
c1(OC)cc2c(CC(=O)O)c(C)n(c2cc1)C(c1ccc(Cl)cc1)=O
c1(cc2c(CC(O)=O)c(C)n(c2cc1)C(c1ccc(cc1)Cl)=O)OC
c1(ccc(C(n2c3c(c(c2C)CC(O)=O)cc(cc3)OC)=O)cc1)Cl
c12c(c(n(c2ccc(c1)OC)C(c1ccc(cc1)Cl)=O)C)CC(=O)O
c12n(c(C)c(CC(O)=O)c2cc(OC)cc1)C(c1ccc(Cl)cc1)=O
c1c(Cl)ccc(c1)C(=O)n1c2ccc(cc2c(CC(O)=O)c1C)OC
c1cc(Cl)ccc1C(n1c(c(c2cc(ccc21)OC)CC(O)=O)C)=O
c1cc(ccc1C(=O)n1c2ccc(OC)cc2c(c1C)CC(=O)O)Cl
...

Honey, I Canonicalised the Molecules

From the second example above, it is evident there can be multiple valid SMILES for a given molecule. This fact arises that there is no natural ordering to a graph, including molecular graphs, and has numerous implications, particularly in downstream computational tasks. For example, SMILES strings cannot be used as primary keys in databases that aim to store unique molecules, and machine learning algorithms based on SMILES must learn that multiple “valid” representations of a given molecule exist.

To overcome this ~~problem~~ “feature”, the original creators of SMILES sought about “canonicalising” the process, so that a molecule could be encoded into a unique SMILES. This algorithm, named CANGEN is a two-step process. CANON; CANonicalisation, GENES; GENerate the unique SMILES.

Unfortunately, this algorithm is not correct (in an algorithmic sense). Molecules, particularly those that have a lot of symmetry (e.g. N, N-Diallylmelamine) have been used to prove the CANGEN algorithm does not satisfy the creation of unique SMILES.

Several canonicalisation algorithms exist, but there is absolutely no guarantee that a canonical SMILES generated by any two approaches is the same, only ever use the canonical SMILES generated by a single algorithm (including version!). Even the most widely used cheminformatics Python/C++ library, the RDKit, maintains a list of known molecules that fail canonicalisation.

Now readers, I hear you ask, if SMILES cannot be canonical, what do we use? The short answer is the International Chemical Identifier (InChI), which are unique*. Well, really, long answer as they are more characters than SMILES, which hinders human-readability, storage and downstream use in algorithms. It is also common to see hashed InChI(s), known as InChIKey(s) to facilitate searching but run the risk of any hash, hash collisions, which are known to occur in InChIKey datasets.

*Unique as long as all optional information such as stereochemistry or tautomeric layers is explicitly provided.

Getting What We Want From a Molecular Language Representation

Okay, so, it appears we cannot always have a beautiful canon SMILES. But is that the reason my machine learning models suck? Probably not, or at least not the sole reason. In fact, some studies have shown that using multiple valid SMILES for the same molecule – “SMILES augmentation” – or even invalid SMILES improves performance.

But maybe by identifying some other issues with SMILES, we can go about determining what it is that we want from a molecular language representation.

Generative models, so hot right now. The strict syntax of SMILES creates a critical failure mode in generative chemistry models. A valid SMILES string requires perfectly balanced parentheses and matched ring numbers. Generative models, which sample tokens probabilistically, often fail to adhere to these rigid grammatical constraints. This is also a huge issue with formats like InChI.

Machine learning approaches using the SMILES format have struggled with, for example:

Stereochemistry
Rare tokens (often a dataset issue)
Molecular diversity collapse (could be a dataset issue), where the models fail to generate diverse molecules, or fail to traverse molecular space.
The inability to encode 3D information directly (though some have proposed some models “learn” or encode 3D information indirectly).

Luckily for us, there have been variants of SMILES, and new chemical language specifications created to address such problems:

DeepSMILES

The core philosophy of DeepSMILES was to reform the SMILES syntax to be more “machine learning-friendly” by converting absolute, long-range dependencies into relative, local ones. The goal was to improve the validity of sequences produced by generative models without sacrificing the information content of the original representation.

In standard SMILES, rings are closed by matching labels (e.g., the pair of 1s in C1…C1). DeepSMILES replaces the second label with a single digit indicating the size of the ring traversal (the number of atoms back in the sequence to connect to). For example, Indomethacin, if originally encoded as the SMILES:

CC1=C(C2=C(N1C(=O)C3=CC=C(C=C3)Cl)C=CC(=C2)OC)CC(=O)O

becomes the DeepSMILES:

CC=CC=CN5C=O)C=CC=CC=C6))Cl)))))))C=CC=C6)OC)))))))CC=O)O

The characters 5, and 6 instructs the parser to create a bond between the atom they’re after and the atom five and six positions prior in the sequence, respectively. DeepSMILES also implements a conceptually similar change in how branches are encoded, evidenced by a single closing bracket “)”, rather than pairs as seen in the original SMILES.

Therefore, language models are no longer required to maintain “long-term memory” (context) to keep track of concepts such as rings as is the case for base SMILES or InChI, and improves decoding validity as a ring is now represented by a single character.

The trade-off for DeepSMILES is that the strings themselves are slightly less readable. SMILES (at least, I find) are quite easily read left-to-right, but in the case of DeepSMILES a “look-back” (with counting!) is required to, for example, close rings. DeepSMILES also only addresses syntactic validity (parsing errors), but does not address semantic validity (think chemical validity) such as valence. Additionally, models that generate DeepSMILES still share many issues as SMILES models, such as difficulty with stereochemistry, molecular diversity collapse and the specification lacks canonicalisation.

SELFIES, and Extensions

The central innovation of SELFIES is the guarantee of 100% robustness: every possible permutation of tokens in the SELFIES alphabet corresponds to a valid molecular graph. This contrasts sharply with SMILES, or DeepSMILES, where a random string is almost certainly invalid.

They also look quite different, as a text representation. Once again, Indomethacin:

[C][C][=C][Branch2][Ring2][C][C][=C][Branch2][Ring1][Ring1][N][Ring1][Branch1][C][=Branch1][C][=O][C][=C][C][=C][Branch1][Branch1][C][=C][Ring1][=Branch1][Cl][C][=C][C][=Branch1][Ring2][=C][Ring1][S][O][C][C][C][=Branch1][C][=O][O]

As you can see, they are huge.

SELFIES ensures validity by construction, but this sometimes results in chemically unrealistic or unexpected structures.

If a token requests a double bond (e.g., [=C]) but the current atom (e.g., Oxygen) only has one valence slot remaining, the parsing engine automatically downgrades the bond to a single bond or ignores the instruction entirely. This ensures that the octet rule and other valence constraints are never violated, but feels like we are cheating a bit, especially given that physical constraints can still be unmet.

To address the length of the SELFIES, there is an extension called GroupSELFIES that can create – as the name implies – SELFIES containing groups. For example, if we add to the grammar of the SELFIES a fragment that matches 6-membered rings with some flexibility, our new GroupSELFIES would be:

[C][C][=C][Branch][:3fragment][Ring1][O][C][pop][Ring2][N][Branch][C][=Branch][=O][pop][:0fragment][Ring2][Cl][pop][pop][Ring1][N][pop][pop][C][C][=Branch][=O][pop][O]

Specifically: a 1,3,5-trisubstituted benzene, where the wildcard substituents must also be atoms of the ring, or as a SMARTS:

C1=C(*1)C=C(*1)C(*1)=C1*1

GroupSELFIES proves itself to be quite useful if you are able to group substructures together, but the groups must either be manually specified (time consuming, requires expert domain knowledge), or extracted from a dataset (computationally exhaustive, exponentially scaling), and there is more to be done to understand the ideal grammar and group size for machine learning applications.

However, SELFIES (e.g. STONED-SELFIES) have successfully been shown to improve semantic validity, and suffer less from the molecular diversity collapse.

Emerging and Domain Specific Variants

Briefly, t-SMILES introduced by Wu et al. 2024 is a tree-based fragmented SMILES variant, which they report outperforms classical SMILES, DeepSMILES, SELFIES and baseline models on some tasks.

Another fragment approach, from Noutahi et al. 2023 is SAFE, which was shown to outperform GroupSELFIES, and creates fragment embeddings based on probabilistic fragmentation.

Most recently, fragSMILES from Mastrolorito et al. 2025 boasts improved performance over t-SMILES, and provides a grammar-driven fragment-based tokenisation of SMILES.

BigSMILES and pSMILES are extensions designed for macromolecules and polymers, and have shown to outperform at least classical SMILES in their relevant domain-specific tasks.

ClearSMILES is a variant, similar in concept but different in implementation, to DeepSMILES from Reboul et al. 2025 that improves performance in generative models.

A Note on Tokenisation

The following section is brief, as I have deemed it largely outside of the scope for this already quite long blopig article. However, it would be remiss of me to fail to mention that as with any language model, molecular language models critically depend on tokenization schemes. Attention (no pun intended) must be paid when implementing these models. For example, the C in the English spellings of Carbon and Chlorine, are fundamentally the same letter, C. However, the “C”’s in the Indomethacin SMILES string (once more for reference):

CC1=C(C2=C(N1C(=O)C3=CC=C(C=C3)Cl)C=CC(=C2)OC)CC(=O)O

are overloaded, despite being fundamentally different atoms, and tokenisation schemes such as byte pair encoding (BPE) may encode them as the same token (i.e. the model “sees” them as the same). Similarly, such tokenisation strategies may create token pairings like ”C)(”, which may contribute to worse performance and chemical invalidity. For more, I would encourage you to read the preprint by Wadell et al., which systematically compares tokenisers for chemistry language models, and introduces their own, Smirk which I have used with great success.

Conclusion

Chemical languages are having their Renaissance.

The transition from archival chemical formats (think filing cabinets) to machine-learning-oriented representations marks a genuine “linguistic turn” in cheminformatics. As SMILES, DeepSMILES, SELFIES, and their successors illustrate, the demands of generative modelling and structured learning require representations that are both syntactically robust and semantically grounded. Yet tokenisation, canonicalisation, and 3D information remain persistent challenges.

Rather than searching for a universally optimal encoding, it may be more fruitful to view chemical languages as task-specific tools. Different problems, such as generation, property prediction, and exploration of chemical space may require different grammars, abstractions, or model architectures.

In my opinion, no single representation is perfect, and none will “win” outright, but I dream of a future where generative models don’t collapse into chemical gibberish, and I have a feeling in that reality the representations will need to continue to evolve with the algorithms that use them.

In many ways, if the 20th century was about writing down molecules, the start of the 21st century has been about teaching machines to write them back.

Author

Alexander Hasson

View all posts