LLMs | Oxford Protein Informatics Group

For the last few months, I’ve been building an agent around OPIG’s antibody analysis and design tools, and I thought I’d share some practical notes from the process.

An agent is a language model that doesn’t just answer questions but can also decide what to do, call tools, and follow workflows. I’m using Claude in these notes, but most of the ideas apply equally well to other agent frameworks.

Rather than building an agent from scratch, we’re starting with one that already comes with useful capabilities out of the box. For example, Claude Code can search files, edit code, execute commands, and run scripts. Everything below is really about adapting that behaviour to a specific domain and workflow.

How to start?

Start with the `CLAUDE.md` file. It’s a special file Claude reads at the start of every conversation, and it’s where you define the behaviour of the agent (other agents have their own equivalent — for example `AGENTS.md`). In this file, include things like bash commands, code style preferences, and workflow rules. This gives Claude a persistent context that it can’t infer from the codebase alone. Since it’s loaded every session, it sets the baseline for how the agent behaves.

Start simple – especially if it’s your first time. Define clear tools, write lightweight instructions in the markdown (md) file, and create realistic evaluations before adding complexity.

Then run a loop where the agent gathers context, takes actions, and verifies the outputs. Think about how you’ll verify them first: if you can’t tell whether a run was good, you can’t tell whether your changes helped.

In research, you don’t always know how a project will evolve, so you’ll often end up making many changes along the way. But for projects that are relatively well-defined, I’ve found it’s worth spending some time upfront with pen and paper, specifying what you want the agent to do before writing it all out.

From there, most development becomes an iterative process of improving the md files and adjusting tools when needed.

What is a tool?

A tool gives the agent a capability. It executes an action and returns a result — calling an API, running code, querying a database, and so on.

The key idea is that tools are deterministic: given the same input, they produce the same output. So if I ask, “Can you check whether this is an antibody?”, the agent will always reach for the same tool — `execute_run_anarci()` — and get the same result.

A tool can be an MCP server or simply a Python function; what matters is that it gives the agent a reliable way to perform a specific action. Both work.

For example, I implemented execute_anarci_number() as a Python function — a thin wrapper around ANARCI — and it returns a structured JSON output with the results and the execution status. All the tools follow the same general structure, which makes them easier for the agent to use consistently.

The signature and docstring are really all the agent needs to decide when to reach for it:

def execute_anarci_number(sequence: str, chain_name: str = "Chain") -> dict: """Identify and number an antibody/TCR sequence using ANARCI. Returns chain type, species, numbering, and whether it's a valid antibody. Chain types: H=Heavy, K=Kappa light, L=Lambda light, A=TCR-alpha, B=TCR-beta """

The function itself is simple: it runs ANARCI, parses the numbering, extracts the CDRs, and checks whether the input looks like a real, complete variable domain. Instead of returning a bare error when numbering fails, the tool returns a structured verdict the agent can reason about:

# numbering failed → the sequence just isn't an antibody (not a tool error)
return { "success": True, "chain_name": chain_name, "is_antibody": False, "is_tcr": False, "chain_type": None, "species": None, "message": "ANARCI could not number this sequence. " "It is likely not an antibody or TCR variable domain.", "sequence_length": len(sequence), }

One thing I found useful is having tools return an explicit verdict, not just output, so the agent knows whether it received an answer, encountered an error, or was given an invalid input.

A few things that helped:

Use the agent itself to help write the tools. It’s good at it, especially if you give Claude documentation for any software libraries, APIs, or SDKs you’re wrapping.
Don’t forget to document the tool in the markdown workflow file so the agent knows it exists and when to use it.
Open a fresh session and check the agent can actually call the tools correctly before building on top of them.

What is a skill?

Skills extend Claude with procedural knowledge. They teach the agent how to perform a task, not just what tools are available.

I think of tools as capabilities and skills as workflows. Tools let the agent do something; skills tell it how to approach a task. A tool might tell Claude how to number an antibody sequence. A skill tells it how to carry out an antibody analysis workflow: which tools to use, in what order, what outputs to expect, and how to interpret the results.

Without skills, the model has to rediscover that workflow from scratch each time. Skills package it once and make it reusable.

A skill is just a folder containing a SKILL.md file (instructions plus metadata) and optional scripts or reference material. One nice advantage is portability: because a skill is just a folder of markdown and scripts, you can write it once and reuse it across different projects, environments, and even different agent frameworks.

To make it concrete, here’s one of mine: ab-diversity-select. After an optimization run, I’m left with dozens of candidate antibodies and need to select a small, maximally diverse subset where the retained mutations remain structurally safe. Rather than re-explaining that workflow every time, I captured it as a skill:

ab-diversity-select/ ├── SKILL.md # when to use it + the procedure ├── structural_pipeline.py ├── pipeline.py └── config_template.py

The SKILL.md header tells Claude when the skill is relevant:

name: ab-diversity-select description: >- Select a structurally-validated, maximally-diverse subset of antibody candidates from a results CSV…

The rest of the file describes the procedure, while the accompanying scripts do the heavy lifting. When Claude encounters a task like “pick 20 diverse antibody candidates,” it can automatically apply my workflow instead of inventing a new selection strategy from scratch.

Practices that worked for me

There’s already a lot of useful information out there, for example:

anthropic.com/engineering

Claude Code best practices

A few things I’d highlight:

Keep the markdown files organized. `CLAUDE.md` is loaded every session, so only put things in it that apply broadly. For domain-specific knowledge or workflows that are only relevant sometimes, use skills instead. There’s no required format for `CLAUDE.md`; just keep it short and human-readable. Mine roughly covers: setup & environment, architecture & code map, and failure handling.

Use subagents to protect the context. Once the basic agent is working, most improvements come from managing context effectively. Subagents run in their own context with their own set of allowed tools. They’re useful for subtasks that require a lot of context. For example, summarizing a paper. In practice, though, I mostly used them for tools that generate large outputs, where it becomes difficult for a single agent to process everything cleanly within one context window.

I defined small operator agents that return only compact summaries. The main agent stays focused on planning and interpretation, large tool outputs stay outside its context, and cheaper, faster models handle parsing and batch work.

Prompts matter — a lot. Performance changes significantly depending on the prompt. From my experience, when building longer workflows, improving the prompt often helps more than editing the markdown files.

For example, explicitly defining the expected output format and level of detail can reduce lazy behaviour and make the agent more consistent across runs.

One approach I like is building a skill that interviews the user up front about the information you care about using the built-in `AskUserQuestion` tool, and then generates the prompt from the user’s answers in a structured way.

Use the agent to explain its own failures. The agent is actually pretty good at explaining where it failed and why. Use it to help debug and improve itself. Ask it what went wrong, have it suggest edits to the markdown files, or ask what it learned during the session. Some of my best improvements came from just asking the agent why a run failed.

A few bio-specific lessons

First, watch the jargon and define your terms. “Diverse” might mean sequence distance, V-gene spread, or structural diversity. Say exactly what you mean, or define it explicitly in your workflow files.

Second, the agent will always give you an answer, so make sure it is grounded in tools rather than invented. A language model can easily produce a confident, plausible-looking sequence or numbering out of thin air. If you do not explicitly tell the agent to use the available tools, it may continue without them, even when they exist.

Finally, keep a human in the loop. Read the logs yourself, understand what happened, and do not trust a clean-looking summary on its own. Ask the agent to explain each step and justify its decisions — that is often the fastest way to catch a wrong assumption before it ends up in your results.

Agents are surprisingly capable, but I still found it challenging to get them to reliably execute long workflows without intervention. In practice, I had the most success when treating the agent as a collaborator rather than a fully autonomous system, giving it clear tools, workflows, and checkpoints along the way.

Building agents is still a fast-moving area, and there are many ways to approach it. It can feel confusing at first, but once you start experimenting and building real projects, things become much clearer. My advice would be to start simple, build something useful, and learn by doing.

References:
1. https://code.claude.com/
2. https://code.claude.com/docs/en/agent-sdk/modifying-system-prompts
3. https://youtu.be/TqC1qOfiVcQ?si=K24t3oxuHgYWs375
4. https://www.aiwithamitay.com/p/skills

Over the last few years, large language models (LLMs) have gone from being curiosities tucked away in research labs to something most of us interact with on a daily basis; whether for drafting emails, debugging code, or simply pondering the meaning of life at 2am. And yet, for all our reliance on these systems, a rather inconvenient truth lingers in the background: nobody, not even the people who built them, can fully explain what is going on inside.

This is where mechanistic interpretability comes in.

In essence, mechanistic interpretability is the approach of explaining complex machine learning systems through the behaviour of their functional units (Kästner and Crook, 2024) by reverse-engineering them into their more elementary computations (Rai et al., 2025). The aim is not simply to know that a model gives the right answer, but to pull apart the underlying machinery and uncover the causal relationships between input and output. Think of it as neuroscience for neural networks, except we can read every neuron at any moment, rewind, replay, and intervene mid-thought.

Continue reading →

Oxford Protein Informatics Group

or "OPIG" to friends

Category Archives: LLMs

Building an Agent – Practical Notes for Beginners

How to start?

What is a tool?

What is a skill?

Practices that worked for me

A few bio-specific lessons

Peering Inside the Black Box: A Beginner’s Introduction to Mechanistic Interpretability

Will TurboQuant save us from the RAM apocalypse?

Building a “Second Brain” – A Functional Knowledge Stack with Obsidian