I have run ChemIQ (our chemical reasoning benchmark) on GPT-5. The model achieves state-of-the-art performance with substantial improvements in the ability to interpret SMILES strings. Read my analysis and initial findings below. Scroll to the end for some cool demos.

Figure 1: Success rates for each model on the ChemIQ reasoning benchmark. Horizontal brackets between adjacent bars indicate the result of a two-tailed McNemar’s test comparing paired outcomes for the same questions. Significance levels are shown as: n.s. (not significant, p ≥ 0.05), * (p < 0.05), ** (p < 0.01), and *** (p < 0.001).
ChemIQ is a chemical reasoning benchmark
In May we released our large language model (LLM) chemical reasoning benchmark which assesses whether LLMs understand the structure of molecules. The benchmark includes a range of questions, starting with easy counting tasks, and progressing to very difficult tasks such as structure elucidation from 2D NMR data. In July we released an updated preprint where we benchmarked additional reasoning models and improved our NMR questions. You can read the paper on arxiv here: Assessing the Chemical Intelligence of Large Language Models (https://arxiv.org/abs/2505.07735).
GPT-5 reaches SOTA performance on ChemIQ
I ran our benchmark within 2 minutes of gaining access to GPT-5. The headline result is that GPT-5 achieves state-of-the-art (SOTA) performance on ChemIQ, scoring 70.2% and exceeding the previous best models by ~14%. It is important to highlight that our benchmark is designed to test the base LLM without tool use – in each case the LLM has to work through the question step-by-step without the aid of tools or code interpreters.
GPT-5 is not stupid

Figure 2: Radar plot showing model performance by question sub category. Additional model results can be found in the ChemIQ preprint (https://arxiv.org/pdf/2505.07735)
When presenting my research, I always use the line “large language models are extremely smart, but also really stupid“. I say this because LLMs struggle with apparently trivial tasks of counting how many letters are in a word (i.e. carbon counting), however can also do really impressive tasks like NMR elucidation. I need to update my research presentations now though: in our benchmark, GPT-5 was essentially perfect at the carbon counting, ring counting, shortest path, and Free-Wilson analysis tasks. It appears GPT-5 rarely makes stupid mistakes.
Of greatest surprise to me is that GPT-5 achieved 99% on the shortest path tasks. These questions are genuinely difficult, requiring interpretation of the graph structure of molecules from the SMILES string and then doing a path finding algorithm to find the shortest path between two points. Previously o3-mini struggled to answer this question when presented with a randomized SMILES representation (in this question, canonical SMILES is substantially easier to solve). In these results, GPT-5 only made a single mistake in all 108 shortest path questions.
Gemini 2.5 pro is still best at NMR elucidation
My favorite subcategory in our benchmark is the 2D NMR elucidation questions. As a chemist, the concept of an LLM being able to solve the structure of a molecule from NMR data is mind blowing (and I think most chemists feel the same way). The results of our test show that GPT-5 has not had a substantial gain in its NMR elucidation capabilities, and Gemini 2.5 Pro is still in the lead. Specifically on the 2D NMR subset, both o3-mini and GPT-5 scored 6% (3/50) whereas Gemini 2.5 Pro scored 20% (10/50). I would have expected GPT-5 to do much better at these questions; I will try to investigate this further.
Time for ChemIQ-2?
When I started working on reasoning models in January, I anticipated the performance of LLMs in chemistry to improve rapidly over the course of the year. My supervisors and I expected ChemIQ to be saturated by the end of the year. While GPT-5 only reaches 70% overall success, it has achieved essentially perfect performance on four out of eight question categories. Don’t worry though, we anticipated this would happen. The questions in ChemIQ are algorithmically generated meaning we can quickly create a new set of even harder questions when needed. I also have a few more novel benchmark questions that we didn’t include in the original paper, which I might include in ChemIQ-2.
Demo 1: Interactive kinetics and titration dashboard
This is kinda crazy. I gave the model the prompt “Create an impressive chemistry demo that showcases your capabilities. I need to embed this in our wordpress blog”. In one go, GPT-5 generated an interactive web app (embedded below). You can enter Reaction Kinetics or Acid–Base Titration parameters and the tool will plot the expected curves. (I didn’t do any prompt optimization, I didn’t try multiple times, this is what came out first try.)
🔬 ChemLab Interactive
Kinetics • TitrationArrhenius & First-Order Decay
Demo 2: Do something really impressive using RDKit. Create an info graphic using the results.
As in my previous blog post, I have very little creativity and have no idea what to ask these models. I used the above prompt and this is what I got (single prompt, first try). Previously this multi-step analysis would have taken a short conversation and iterative prompting; GPT-5 did this all by itself in one go. You can click on the info graphic below to see it in full; I have also uploaded the figures separately so they render better.



Demo 3: Molecular orbital theory app
As my last demo, I tried making an app for visualizing molecular orbitals. Unfortunately, the model never managed to visualize 3D molecular orbitals (which is what I really wanted). After a bit of vibe coding, I arrived at this app for simple molecular orbital theory. There are clear errors in the chemistry, and I could iterate further to improve the app, but as a quick few prompt vibe-coded project, this is awesome. (wordpress didn’t allow rendering of unicode characters, so GPT-5 has tried to draw it’s own electrons that look a bit questionable).
MO Diagram Builder
What an MO energy diagram shows
Vertical axis = energy. Left/right columns are atomic orbitals (AOs); the centre column contains molecular orbitals (MOs) formed by symmetry-matched combinations of AOs.
Bonding vs antibonding: Bonding MOs are lower in energy than the parent AOs; antibonding MOs are higher. Electrons fill from low to high (Aufbau), one per degenerate pi before pairing (Hund), max two with opposite spins (Pauli).
2p ordering crossover: For B2–N2, strong 2s–2p mixing raises sigma(2p) relative to pi(2p), so pi(2p) lies below sigma(2p). For O2–F2, mixing weakens with Z and the order inverts: sigma(2p) below pi(2p). Oxygen's two unpaired electrons in pi*(2p) make O2 paramagnetic.
Bond order = (bonding e - antibonding e) / 2. Any unpaired electrons imply paramagnetism.
Developer tests (click to run)
The Pauling Principle – Can language models understand chemistry?
I was recently a guest on the The Pauling Principle podcast where Javier and I discussed language models for chemistry. You can listen to our chat wherever you listen to podcasts (e.g. spotify (https://open.spotify.com/episode/44PC0FDk0EPyPXsbGS6ulK?si=43f68c77dc254da1), or on youtube:
Conclusion
It has been 2 months since I wrote my previous blog post about ChatGPT having access to RDKit (https://www.blopig.com/blog/2025/06/chatgpt-can-now-use-rdkit/). The progress that has been made in such a short time is truly astounding. I can’t possibly capture in a single blog post all the capabilities of GPT-5, you really must try it out for yourself. As previous, I would love for chemists to trying asking these models to do crazy tasks like interpreting their data and generating hypotheses. These models are getting very smart very quick, and I think they will now be helpful in scientific discovery. Please share your findings with me! I’m interested to see what others get these models to do.
– Nicholas

