Peering Inside the Black Box: A Beginner’s Introduction to Mechanistic Interpretability

Over the last few years, large language models (LLMs) have gone from being curiosities tucked away in research labs to something most of us interact with on a daily basis; whether for drafting emails, debugging code, or simply pondering the meaning of life at 2am. And yet, for all our reliance on these systems, a rather inconvenient truth lingers in the background: nobody, not even the people who built them, can fully explain what is going on inside.

This is where mechanistic interpretability comes in.

In essence, mechanistic interpretability is the approach of explaining complex machine learning systems through the behaviour of their functional units (Kästner and Crook, 2024) by reverse-engineering them into their more elementary computations (Rai et al., 2025). The aim is not simply to know that a model gives the right answer, but to pull apart the underlying machinery and uncover the causal relationships between input and output. Think of it as neuroscience for neural networks, except we can read every neuron at any moment, rewind, replay, and intervene mid-thought.

The field, as laid out by Olah and colleagues, rests on three core ideas: features, circuits, and universality (Olah et al., 2020). Features are the human-interpretable properties, such as the Eiffel Tower, sarcasm, the colour blue, encoded by model activations. Circuits describe how those features are extracted, routed, and combined to produce a model’s output. Universality asks whether the features and circuits identified in one model also crop up in others, hinting that something genuinely general might be going on rather than a quirk of a particular architecture. Of course, it is rarely that tidy. One of the field’s central headaches is superposition: the inconvenient observation that neural networks pack far more features into their neurons than there are neurons available. The result is polysemantic neurons: single neurons that light up for several seemingly unrelated concepts, much to the dismay of anyone trying to interpret them cleanly.

The most influential attempt to untangle this mess has been the sparse autoencoder (SAE), which projects dense, polysemantic activations into a much larger, sparser latent space where individual features can be recovered cleanly. Arguably the greatest surge of interest in the field was catalysed by Anthropic’s Scaling Monosemanticity (Templeton et al., 2024), which scaled the approach up to Claude 3 Sonnet and famously surfaced a “Golden Gate Bridge” feature. By clamping that feature to absurdly high values, the team produced a model that would steer any conversation back toward the bridge regardless of what you asked it — a memorable demonstration that meaningful features really do live inside these models, and that we can grab hold of them to steer behaviour.

Subsequent work at times skipped the SAE step entirely. A more direct approach, in Anthropic’s Emergent Introspective Awareness in LLMs (Lindsey, 2025), skips the autoencoder altogether and intervenes directly on patterns of neuron activations in the residual stream. The setup injects a representation of a known concept into the model’s hidden state and asks the model whether it noticed anything unusual. Strikingly, capable models like Claude Opus 4 and 4.1 sometimes correctly identify the injected concept, a first hint of ‘introspection’. As a note, the authors make it clear that this does not necessarily indicate consciousness in the traditional sense and highlight how further deliberation is required to arrive at a more relevant definition on consciousness in the context of transformer based language models.

The two approaches are thematically the same. Both rely on the fact that meaningful features live in the model’s activations but differ in how they go about identifying and exploiting them. SAEs first decompose activations into a tidy latent dictionary of features and then operate on these latent features; direct activation methods bypass the dictionary and work on the raw vectors, i.e., the neuron activations. The latter cuts out the “middle man” and sidesteps some of the well-documented pathologies of SAE training, such as feature splitting and absorption (Chanin et al., 2024), at the cost of working in a denser, less obviously interpretable space.

The most recent advancement on this lineage is Anthropic’s Natural Language Autoencoders (NLAs), released in May 2026 (Fraser-Taliente et al., 2026). Rather than producing numerical features that a researcher must then puzzle over, NLAs train one copy of Claude to describe an activation in plain English, and a second copy to reconstruct the original activation from that description — if the round-trip works, the description is taken as a faithful translation of what the model was “thinking”. The result is, effectively, a readable window onto the residual stream. The results from the study showed how Claude’s internal activations indicated suspicion that it was being tested during the benchmark runs. Anthropic has also reported using NLAs to catch a Claude Mythos preview variant cheating on a pre-deployment auditing task and quietly reasoning about how to avoid getting caught. These results, perhaps, solidify the growing need for mechanistic interpretability research; as models get ‘smarter’ it becomes more important to supplement traditional benchmarking strategies with a more targeted understanding of model computations.

This is also why mechanistic interpretability is no longer an internal concern of frontier labs alone. As LLMs spread beyond generic chatbots into domains with real consequences, retaining confidence in model outputs becomes harder and more important — and the same interpretability toolkit is being repurposed across fields. Biology has become a particular hotspot: Anthropic’s Transformer Circuits team recently reviewed five separate works applying SAEs to biological foundation models , with InterPLM (Simon and Zou, 2025) the most prominent, recovering known biophysical features — active sites, zinc finger domains, disulfide bonds — directly from ESM-2’s activations. Our own recent work extends this line to autoregressive antibody language models, applying TopK and Ordered SAEs to p-IgGen and similar models and using them not just to identify CDR and germline-gene features, but to predictably steer generation of antibody libraries (Haque et al., 2025). In drug discovery especially, knowing why a model proposes a given sequence is rapidly becoming as important as whether the proposal happens to work.

For a newcomer, Olah’s Zoom In essay and the Transformer Circuits Thread remain the cleanest entry points. From there, the rabbit hole runs deep: induction heads, activation patching, residual streams, attention pattern analysis. You’ll quickly realise that interpretability is less a single discipline and more a sprawling toolkit, borrowed bits at a time from linear algebra, neuroscience, and good old-fashioned reverse-engineering.

It is messy, slow, and frequently humbling work. But in a world where models can apparently tell when they’re being evaluated, and where they’re increasingly being asked to help design drugs, peering inside the black box has gone from an academic curiosity to something rather more urgent.

References

Chanin, D., Wilken-Smith, J., Dulka, T., Bhatnagar, H., Bloom, J., 2024. A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders. https://doi.org/10.48550/arXiv.2409.14507

Fraser-Taliente, K., Kantamneni, S., Ong, E., Mossing, D., Lu, C., Bogdan, P.C., Ameisen, E., Chen, J., Kishylau, D., Pearce, A., Tarng, J., Wu, A., Wu, J., Zhang, Y., Ziegler, D.M., Hubinger, E., Batson, J., Lindsey, J., Zimmerman, S., Marks, S., 2026. Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations. Transform. Circuits Thread.

Haque, R., Turnbull, O.M., Parsan, A., Parsan, N., Yang, J.J., Beukenhorst, A.L., Deane, C.M., 2025. Mechanistic Interpretability of Antibody Language Models Using SAEs. https://doi.org/10.48550/ARXIV.2512.05794

Kästner, L., Crook, B., 2024. Explaining AI through mechanistic interpretability. Eur. J. Philos. Sci. 14, 52. https://doi.org/10.1007/s13194-024-00614-4

Lindsey, J., 2025. Emergent Introspective Awareness in Large Language Models. Transformer Circuits Thread.

Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., Carter, S., 2020. Zoom In: An Introduction to Circuits. Distill. https://doi.org/10.23915/distill.00024.001

Rai, D., Zhou, Y., Feng, S., Saparov, A., Yao, Z., 2025. A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models. https://doi.org/10.48550/arXiv.2407.02646

Simon, E., Zou, J., 2025. InterPLM: discovering interpretable features in protein language models via sparse autoencoders. Nat. Methods 22, 2107–2117. https://doi.org/10.1038/s41592-025-02836-7

Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N.L., McDougall, C., MacDiarmid, M., Freeman, C.D., Sumers, T.R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., Henighan, T., 2024. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transform. Circuits Thread.

Author

Rebonto Haque

View all posts

Oxford Protein Informatics Group

or "OPIG" to friends

Peering Inside the Black Box: A Beginner’s Introduction to Mechanistic Interpretability

References

Author