For more than a century, chemists have been trying to squeeze the beautifully messy, quantum-smeared reality of molecules into tidy digital boxes, “formats” such as line notations, connection tables, coordinate files, or even the vaguely hieroglyphic Wiswesser Line Notation. These formats weren’t designed for machine learning; some weren’t even designed for computers. And yet, they’ve become the wedged into the backbones of modern drug discovery, materials design and computational chemistry.
The emergent use of large language models and natural language processing in chemistry posits the immediate question: What does it mean for a molecule to have a “language,” and how should machines speak it?
if molecules are akin to words and sentences, what alphabet and grammatical rules should they follow?
What follows is a tour through the evolving world of chemical languages, why we use them, why our old representations keep breaking our shiny new models, and what might replace them.
Continue reading