Category Archives: LLMs

Peering Inside the Black Box: A Beginner’s Introduction to Mechanistic Interpretability

Over the last few years, large language models (LLMs) have gone from being curiosities tucked away in research labs to something most of us interact with on a daily basis; whether for drafting emails, debugging code, or simply pondering the meaning of life at 2am. And yet, for all our reliance on these systems, a rather inconvenient truth lingers in the background: nobody, not even the people who built them, can fully explain what is going on inside.

This is where mechanistic interpretability comes in.

In essence, mechanistic interpretability is the approach of explaining complex machine learning systems through the behaviour of their functional units (Kästner and Crook, 2024) by reverse-engineering them into their more elementary computations (Rai et al., 2025). The aim is not simply to know that a model gives the right answer, but to pull apart the underlying machinery and uncover the causal relationships between input and output. Think of it as neuroscience for neural networks, except we can read every neuron at any moment, rewind, replay, and intervene mid-thought.

Continue reading

Will TurboQuant save us from the RAM apocalypse?

The LLM boom is causing a global shortage of the very same computer memory it needs to sustain itself. Reports suggest OpenAI’s Stargate project alone could consume up to 40% of global DRAM output. Frontier labs like Google DeepMind need to make their models more memory-efficient.


One such technique is TurboQuant, released by Google. TurboQuant is an example of an online “quantisation” method. LLMs represent information using large tensors of numerical values, where each number typically uses 64 or 32 bits. However, many values do not require full numerical precision, so we can “round” them using fewer bits and less memory. We can see this in the example below:

The rounded value now requires 4x less memory. Source

Some quantisation methods are applied offline before inference begins. TurboQuant is ‘online’ because it compresses the KV cache dynamically during inference.

A good example of a quantised model is the London Underground map, seen below. Transport for London does not show the full geography because that would be hard to read. The map is meant to help people get from the airports and suburbs to the city centre. So the suburbs and airports do not need to be shown in full detail, but the centre still needs to stay fairly true to life.

Comparison between the London Tube map and its real geography

The Tube map works by compressing the data we care less about and preserving the information we care more about. That raises the key question: how does a quantisation method know how to preserve the important parts, while compressing the less useful parts?

TurboQuant answers this by splitting KV-cache channels into standard and outlier groups. Transformer-based LLMs store the recent history of a conversation in a context window made up of tokens. KV (key-value) cache stores previously computed key and value tensors so they do not need to be recomputed during generation. This makes token generation faster, but also quickly blows up GPU RAM usage. As context windows grow longer, KV-cache memory usage grows with them. TurboQuant reduces memory usage by compressing less important channels more aggressively while preserving higher precision for “outlier” channels that contain more significant information. This allows models to maintain output quality while substantially reducing GPU RAM requirements during inference. Using the Tube map analogy, the standard channels would compress the suburbs, and the outlier channels would consist of Central London. The diagram below summarises this.

Diagram of the technique’s workflow. Generated by NotebookLLM

According to their paper, TurboQuant beats other quantisation methods, but it would have been good to see tests on more LLMs.

So, will TurboQuant save us from the RAM apocalypse? Probably not on its own. However, techniques like it are becoming increasingly important as LLMs grow larger and context windows expand. As AI companies compete to build ever more capable systems, memory efficiency may become just as important as raw compute power.

Building a “Second Brain” – A Functional Knowledge Stack with Obsidian

Whilst I always enjoy the acquisition of knowledge, I’ve always struggled with depositing it usefully. From pen and paper notes with a 20 colour theme which lost value with each additional colour, to OneNote or iPad GoodNotes based emulations of pen and paper, it’s been a constant quest for the optimal note taking schema. Personally there are 3 key objectives I need my note taking to achieve:

  1. It must be digitally compatible and accessible from any device.
  2. It must comfortably handle math and images.
  3. It must be something I look forward to – the software needs to be aesthetically clean, lightweight with none of the chunkiness of Microsoft apps, and highly customisable.

For me the solution to this was Obsidian, the perhaps more cultified sibling to Notion. Obsidian is a note taking application that uses markdown with a surprising amount flexibility, including the ability to partner it with an LLM which I’ll explore in this blog, alongside my vault organisation do or dies, and favourite customisations.

Continue reading