In the past month we have surpassed a critical threshold with the capabilities of agentic coding models. What previously sounded like science fiction has now become reality, and I don’t believe any of us are ready for what is to come. In this blog post I share a summary of the breakthrough I am referring to, I give an insight into how I use agents to accelerate my research, and I make some predictions for the year. With pride, I can say this entire blog post was 100% written by me without any support from ChatGPT (except spell checking and the image below).
My Existential Crisis
On 26th December 2025 Andrej Karpathy shared a reflection about the state of agents and coding. This post made me realise I am totally behind on the agentic revolution. Even though I spend hours each day reading and experimenting with tools, I find it nearly impossible to keep up. I am working at 100 mph, and I am still in the dust compared with the AI labs:
“I’ve never felt this much behind as a programmer. The profession is being dramatically refactored as the bits contributed by the programmer are increasingly sparse and between. I have a sense that I could be 10X more powerful if I just properly string together what has become available over the last ~year and a failure to claim the boost feels decidedly like skill issue. There’s a new programmable layer of abstraction to master (in addition to the usual layers below) involving agents, subagents, their prompts, contexts, memory, modes, permissions, tools, plugins, skills, hooks, MCP, LSP, slash commands, workflows, IDE integrations, and a need to build an all-encompassing mental model for strengths and pitfalls of fundamentally stochastic, fallible, unintelligible and changing entities suddenly intermingled with what used to be good old fashioned engineering. Clearly some powerful alien tool was handed around except it comes with no manual and everyone has to figure out how to hold it and operate it, while the resulting magnitude 9 earthquake is rocking the profession. Roll up your sleeves to not fall behind.” – Andrej Karpathy, 26th December 2025
METR Benchmark
Around the same time, the METR results for Claude Opus 4.5 were released. For those unfamiliar, METR is a benchmark which assesses the length of coding tasks that LLMs are able to perform. This is a measure of how long an LLM (agent) is able to work unassisted. The empirical observation from METR until recently was that the time-horizon was doubling every 7 months. It now appears with the advent of reasoning models, and agentic harnesses, that the doubling rate is ~4.5 months. By all predictions, it is expected that by the end of the year we will surpass the 12 hour time-horizon. The summary is that agents will soon, if not already, be better at coding than any human.
My value as a DPhil student
Here is where the existential crisis hit. The reality is that I don’t believe I have ever implemented a single feature with a time-horizon beyond 5 hours. Here I am defining a ‘feature’ as a discrete unit of code – ie a visualiser, parser, data analysis scripts, etc. The complexity of my research comes from implementing multiple different features in combination, and then ensuring the overall system is sound. At this moment in time, agents are not able to automate my research as the complexity of the overall system I am building is beyond the time-horizon of frontier LLMs; however, my research is simply the combination of multiple parts, and it appears that agents are able to implement each individual part.
The only reason I do not currently fear automation is because LLMs still struggle with chemical accuracy; my value comes from understanding fundamental chemistry and being able to evaluate the capabilities of LLMs in genuine frontier chemistry research and drug discovery. However, with great personal conflict, the aim of my research is to automate myself and build a fully autonomous chemistry researcher which is better at chemistry than I am. When I, or anyone else, achieves this goal (and I have no reason to believe otherwise), I will be scuppered.
As a brief plug before I continue, I already wrote a blog post on a similar theme ‘I Prompt, Therefore I Am: Is Artificial Intelligence the End of Human Thought?‘. I wrote this in October where I predicted that very soon all research will be reduced to a prompting exercise. I didn’t publicise that blog post as I usually do as I felt insecure about my admission, however it appears I was correct.
In an attempt to stay ahead of the coming agentic revolution, I have decided to commit to agentic coding and learn how to use these tools to accelerate my research. I am two weeks in, and here are my early findings.
A brief introduction to agentic coding
Agentic coding is where you allow an LLM to directly read and edit your codebase. These models have terminal access, so can write and execute commands in order to navigate and evaluate your code. Getting these tools set up is extremely easy, and I encourage all readers to give it a go. Multiple tools exist, with the most popular appearing to be Claude Code and CODEX:
Reading files
Agents can navigate your entire codebase and add relevant information to their context. This means when you want to implement a new feature, the agent can find the correct way to do this.
Creating and editing files
When you ask an agent to implement some code, they can create scripts whilst following software engineering best practices. The impressive thing to me is the ability to separate concerns and edit multiple files in parallel.
Planning
I will often request a complex multi-step implementation from CODEX. In these cases, I describe the feature to be implemented and CODEX will then give me a plan for how it will implement this. It may ask clarifying questions, and I may revise its implementation plan. I then say ‘let’s go’ and the agent will execute this plan.
Self checking
The real power of these tools comes from automated testing cycles. The human, or the agent, can construct a suite of tests, and the agent will then automatically run the tests during implementation. This allows the agent to check it’s working and iterate solutions before ending its turn. This means that code returned from the agent is often flawless and works first time. Knowing how to rapidly evaluate the quality of code generated by the agent is the key to accelerated research.
Refactoring
It is true that coding agents can make spaghetti code (although this appears to be becoming less of an issue). When I have worked on a feature and I want to remove legacy code, linting etc I will ask CODEX to refactor. It will then delete the code I don’t want and tidy the repo. This may appear scary at first, however in the last 2 weeks I have not had a single instance where CODEX has made an error. (In the nightmare scenario, I can just revert the changes back with git).
And More
There are substantially more capabilities of coding agents than those listed above including building task-specific agents, skill documents, etc. I am still looking for resources describing how to use these tools, but I think the best method of learning is trial by fire.
Vibe research
Before I continue, it is important to share a note about academic integrity. I take full responsibility for all work I publish. I will only ever publish work that I am certain is of the highest standard. All my research goes through my supervisors Fergus Imrie and Charlotte Deane, both of whom will not let any mistakes slide. At this moment in time I am doing rapid exploration and prototyping, so vibe coding is acceptable. When I get to the publication stage, I will know every single line of code and be willing to defend it in court. With that out of the way… let me tell you how I am vibe coding my research.
GPT 5.2 Pro
OpenAI claims their GPT-5-Pro models to be of ‘PhD level intelligence’. I have been itching to give Pro a go, so I bit the bullet and now have access. My honest evaluation is that for my research area, the quality of responses from this model are at a strong undergraduate/Master’s level. I find this model acts as an expert collaborator. I have had multiple cases where this model has given me insights about my research that I would not have found myself, and that I didn’t gain from the standard GPT-5.2 model. I routinely ask questions which take 10+ minutes for the model to generate a response to, and recently I had my first 30 minute response. In each case I would estimate it would have taken me ~3+ days to create an analysis of equivalent depth. This is not to say the model is flawless; it has generated a few crappy responses, but on the whole I would say it is a powerful model.
Task planning and execution
After brainstorming a feature with GPT 5.2 Pro, I ask the model to create a Markdown context file to give to CODEX. This provides all the information required for CODEX to understand the task and the step-by-step implementation plan. I recently gave CODEX a feature plan which really pushed CODEX to the limit. It spent ~30 minutes implementing the code which involved a fundamental redesign of a core data structure. When it finished, it worked perfectly first time. I then realised that actually I didn’t need this feature and we could do this in a simpler way, so I have now reverted the change. Had I implemented this feature manually this would have taken me days to complete, and I would probably have felt quite frustrated at the end when I realised it wasn’t what I needed. On this occasion CODEX probably saved me 3-5 days worth of work and allowed me to rapidly iterate ideas.
Vibe plotting
For my project I know the input and the expected output. As a chemist, I am able to evaluate the success of CODEX on tasks by visualising these outputs. I have a system setup where I ask CODEX to write work to a ‘staging’ markdown file. The model will emit visualisations, plots, tables, etc and I can immediately see the impact of the code that was implemented. From this, I can say ‘X looks bad, fix Y, also please do Z’, and we then iterate until we arrive at a point that I am satisfied with.
Multi-agent workflow
This is where I start to sound totally mad, however I promise you what I say here is true. I now work with three agents in parallel. I have a screen with ChatGPT 5 Pro open for brainstorming ideas, and I have two separate CODEX terminals open: one for my primary repo, and a second for prototyping new ideas. I then work by managing these three agents in parallel. This means I am able to study the literature and experiment with two separate codebases all at the same time. I am essentially doing the jobs of four people (manager and three workers). I will be honest, it is harder work than I expected it would be to be a manager of agents, however I am quickly improving.
Reflection
My experience these last two weeks has been stark, and at some points scary. It is undeniable that agentic coding has substantially accelerated my research. I am still learning how to use these tools, and I echo Andrej Karpathy’s reflection that if I truly knew how to use these tools properly, my productivity could be 10x over what it was previously. I am still navigating this landscape, and we will see in the next few months how much these tools truly accelerate scientific research.
Predictions for the year
Of course, given it is January, I feel obligated to make some predictions for the year. In all honesty, the first two are not really predictions, but more like observations from the end of 2025 and the prediction is that the world will start catching up in 2026:
- By the end of the year no one will be manually writing code. The concept of human fingers typing code will sound as archaic as punch cards.
- Agentic systems will start being treated like research assistants and will be able to automate significant portions of scientific research.
- A Millennium Prize Problem will be solved (hopefully the Riemann Hypothesis). This is wishful thinking and blind optimism; I just think this will be a very exciting moment when it does finally arrive. While a Millennium Problem this year is very ambitious, I do believe the next one will be solved by an AI model (maybe 2027/28?). At the very least we will see a few more Erdős problems solved this year.
Conclusion
This is an exciting time for AI research, but a potentially very scary time for many of us. Something massive is happening with agentic systems and we are in the process of a total revolution. I have no idea what the limit of these systems will be, but at this current point in time there are no signs of slowing (if anything we are accelerating). It is my hope that these tools are used to the benefit of humanity, and I do believe we will soon see some great outcomes from these models. However there are serious conversations that need to be had around the societal implications of these advances.
If you are interested in LLMs in chemistry, my paper was recently published in the Journal of Chemical Information and Modeling: ‘Assessing the Chemical Intelligence of Large Language Models’ https://pubs.acs.org/doi/10.1021/acs.jcim.5c02145
Nicholas Runcie





