Author Archives: Kingsley Oguma

Will TurboQuant save us from the RAM apocalypse?

The LLM boom is causing a global shortage of the very same computer memory it needs to sustain itself. Reports suggest OpenAI’s Stargate project alone could consume up to 40% of global DRAM output. Frontier labs like Google DeepMind need to make their models more memory-efficient.


One such technique is TurboQuant, released by Google. TurboQuant is an example of an online “quantisation” method. LLMs represent information using large tensors of numerical values, where each number typically uses 64 or 32 bits. However, many values do not require full numerical precision, so we can “round” them using fewer bits and less memory. We can see this in the example below:

The rounded value now requires 4x less memory. Source

Some quantisation methods are applied offline before inference begins. TurboQuant is ‘online’ because it compresses the KV cache dynamically during inference.

A good example of a quantised model is the London Underground map, seen below. Transport for London does not show the full geography because that would be hard to read. The map is meant to help people get from the airports and suburbs to the city centre. So the suburbs and airports do not need to be shown in full detail, but the centre still needs to stay fairly true to life.

Comparison between the London Tube map and its real geography

The Tube map works by compressing the data we care less about and preserving the information we care more about. That raises the key question: how does a quantisation method know how to preserve the important parts, while compressing the less useful parts?

TurboQuant answers this by splitting KV-cache channels into standard and outlier groups. Transformer-based LLMs store the recent history of a conversation in a context window made up of tokens. KV (key-value) cache stores previously computed key and value tensors so they do not need to be recomputed during generation. This makes token generation faster, but also quickly blows up GPU RAM usage. As context windows grow longer, KV-cache memory usage grows with them. TurboQuant reduces memory usage by compressing less important channels more aggressively while preserving higher precision for “outlier” channels that contain more significant information. This allows models to maintain output quality while substantially reducing GPU RAM requirements during inference. Using the Tube map analogy, the standard channels would compress the suburbs, and the outlier channels would consist of Central London. The diagram below summarises this.

Diagram of the technique’s workflow. Generated by NotebookLLM

According to their paper, TurboQuant beats other quantisation methods, but it would have been good to see tests on more LLMs.

So, will TurboQuant save us from the RAM apocalypse? Probably not on its own. However, techniques like it are becoming increasingly important as LLMs grow larger and context windows expand. As AI companies compete to build ever more capable systems, memory efficiency may become just as important as raw compute power.