{"id":14231,"date":"2026-05-14T14:34:10","date_gmt":"2026-05-14T13:34:10","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=14231"},"modified":"2026-05-14T15:08:44","modified_gmt":"2026-05-14T14:08:44","slug":"will-turboquant-save-us-from-the-ram-apocalypse","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2026\/05\/will-turboquant-save-us-from-the-ram-apocalypse\/","title":{"rendered":"Will TurboQuant save us from the RAM apocalypse?"},"content":{"rendered":"\n<p class=\"\">The LLM boom is causing a <a href=\"https:\/\/www.techradar.com\/pro\/the-global-memory-shortage-the-hidden-bottleneck-behind-the-ai-boom\">global shortage<\/a> of the very same computer memory it needs to sustain itself. Reports suggest OpenAI\u2019s Stargate project alone could consume up to <a href=\"https:\/\/www.tomshardware.com\/pc-components\/dram\/openais-stargate-project-to-consume-up-to-40-percent-of-global-dram-output-inks-deal-with-samsung-and-sk-hynix-to-the-tune-of-up-to-900-000-wafers-per-month\">40% of global DRAM output.<\/a> Frontier labs like Google DeepMind need to make their models more memory-efficient.<\/p>\n\n\n\n<p class=\"\"><br>One such technique is <a href=\"https:\/\/research.google\/blog\/turboquant-redefining-ai-efficiency-with-extreme-compression\/\">TurboQuant<\/a>, released by Google. TurboQuant is an example of an online \u201cquantisation\u201d method. LLMs represent information using large tensors of numerical values, where each number typically uses 64 or 32 bits. However, many values do not require full numerical precision, so we can &#8220;round&#8221; them using fewer bits and less memory. We can see this in the example below:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-3.png?ssl=1\"><img decoding=\"async\" width=\"630\" height=\"650\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-3.png?fit=630%2C650&amp;ssl=1\" alt=\"\" class=\"wp-image-14236\" style=\"aspect-ratio:0.9692196510739328;width:426px;height:auto\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-3.png?w=630&amp;ssl=1 630w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-3.png?resize=291%2C300&amp;ssl=1 291w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-3.png?resize=624%2C644&amp;ssl=1 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><figcaption class=\"wp-element-caption\">The rounded value now requires 4x less memory. <a href=\"https:\/\/www.inferless.com\/learn\/quantization-techniques-demystified-boosting-efficiency-in-large-language-models-llms\">Source<\/a><\/figcaption><\/figure>\n<\/div>\n\n\n<p class=\"\">Some quantisation methods are applied offline before inference begins. TurboQuant is \u2018online\u2019 because it compresses the KV cache dynamically during inference.<br><\/p>\n\n\n\n<p class=\"\">A good example of a quantised model is the London Underground map, seen below. Transport for London does not show the full geography because that would be hard to read. The map is meant to help people get from the airports and suburbs to the city centre. So the suburbs and airports do not need to be shown in full detail, but the centre still needs to stay fairly true to life. <\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"876\" style=\"aspect-ratio: 1198 \/ 876;\" width=\"1198\" controls src=\"https:\/\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/reddit_tube_gif.mp4\"><\/video><figcaption class=\"wp-element-caption\"><a href=\"https:\/\/www.reddit.com\/r\/dataisbeautiful\/comments\/b8ihhr\/comparison_between_the_london_tube_map_and_its\/\">Comparison between the London Tube map and its real geography<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"\">The Tube map works by compressing the data we care less about and preserving the information we care more about. That raises the key question: how does a quantisation method know how to preserve the important parts, while compressing the less useful parts?<\/p>\n\n\n\n<p class=\"\">TurboQuant answers this by splitting KV-cache channels into standard and outlier groups. Transformer-based LLMs store the recent history of a conversation in a context window made up of tokens. KV (key-value) cache stores previously computed key and value tensors so they do not need to be recomputed during generation. This makes token generation faster, but also quickly blows up GPU RAM usage. As context windows grow longer, KV-cache memory usage grows with them. TurboQuant reduces memory usage by compressing less important channels more aggressively while preserving higher precision for \u201coutlier\u201d channels that contain more significant information. This allows models to maintain output quality while substantially reducing GPU RAM requirements during inference. Using the Tube map analogy, the standard channels would compress the suburbs, and the outlier channels would consist of Central London. The diagram below summarises this.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-5-scaled.png?ssl=1\"><img decoding=\"async\" width=\"2560\" height=\"1429\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-5-scaled.png?fit=625%2C349&amp;ssl=1\" alt=\"\" class=\"wp-image-14239\" style=\"aspect-ratio:1.7908407382091593;width:655px;height:auto\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-5-scaled.png?w=2560&amp;ssl=1 2560w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-5-scaled.png?resize=300%2C167&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-5-scaled.png?resize=1024%2C572&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-5-scaled.png?resize=768%2C429&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-5-scaled.png?resize=1536%2C857&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-5-scaled.png?resize=2048%2C1143&amp;ssl=1 2048w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-5-scaled.png?resize=624%2C348&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-5-scaled.png?w=1250&amp;ssl=1 1250w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/image-5-scaled.png?w=1875&amp;ssl=1 1875w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><figcaption class=\"wp-element-caption\">Diagram of the technique&#8217;s workflow. Generated by NotebookLLM<\/figcaption><\/figure>\n\n\n\n<p class=\"\">According to their paper, TurboQuant beats other quantisation methods, but it would have been good to see tests on more LLMs.<\/p>\n\n\n\n<p class=\"\">So, will TurboQuant save us from the RAM apocalypse? Probably not on its own. However, techniques like it are becoming increasingly important as LLMs grow larger and context windows expand. As AI companies compete to build ever more capable systems, memory efficiency may become just as important as raw compute power.<br><\/p>\n\n\n\n<p class=\"\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The LLM boom is causing a global shortage of the very same computer memory it needs to sustain itself. Reports suggest OpenAI\u2019s Stargate project alone could consume up to 40% of global DRAM output. Frontier labs like Google DeepMind need to make their models more memory-efficient. One such technique is TurboQuant, released by Google. TurboQuant [&hellip;]<\/p>\n","protected":false},"author":141,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[632,138,915,189,48],"tags":[],"ppma_author":[934],"class_list":["post-14231","post","type-post","status-publish","format-standard","hentry","category-deep-learning","category-journal-club","category-llms","category-machine-learning","category-publication"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":934,"user_id":141,"is_guest":0,"slug":"kingsley","display_name":"Kingsley Oguma","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/72ad4e0e2b92e304830895c929eed951d2938870c920b031594faf157dc2a617?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/14231","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/141"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=14231"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/14231\/revisions"}],"predecessor-version":[{"id":14276,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/14231\/revisions\/14276"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=14231"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=14231"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=14231"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=14231"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}