{"id":13920,"date":"2026-01-28T14:01:43","date_gmt":"2026-01-28T14:01:43","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=13920"},"modified":"2026-02-03T16:25:44","modified_gmt":"2026-02-03T16:25:44","slug":"democratising-the-dark-arts-writing-cuda-kernels-with-claude","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2026\/01\/democratising-the-dark-arts-writing-cuda-kernels-with-claude\/","title":{"rendered":"Democratising the Dark Arts: Writing Triton Kernels with Claude"},"content":{"rendered":"\n<p>Why would you ever want to leave the warm, fuzzy embrace of <code>torch.nn<\/code>? It works, it\u2019s differentiable, and it rarely causes your entire Python session to segfault without a stack trace. The answer usually comes down to the &#8220;Memory Wall.&#8221; Modern deep learning is often less bound by how fast your GPU can do math (FLOPS) and more bound by how fast it can move data around (Memory Bandwidth). When you write a sequence of simple PyTorch operations, something like <code>x = x * 2 + y<\/code> the GPU often reads <code>x<\/code> from memory, multiplies it, writes it back, reads it again to add <code>y<\/code>, and writes it back again. It\u2019s the computational equivalent of making five separate trips to the grocery store because you forgot the eggs, then the milk, then the bread. Writing a custom kernel lets you &#8220;fuse&#8221; these operations. You load the data once, perform a dozen mathematical operations on it while it sits in the ultra-fast chip registers, and write it back once. The performance gains can be massive (often 2x-10x for specific layers).But traditionally, the &#8220;cost&#8221; of accessing those gains, learning C++, understanding warp divergence, and manual memory management, was just too high for most researchers. That equation is finally changing.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\">Abstraction Sandwich<\/h2>\n\n\n\n<p><br>Usually, we try to solve complexity by piling on abstractions. We write in Python, which calls C++, which compiles down to assembly. We accept the overhead because writing assembly is painful. The amazing thing about using LLMs for kernel programming is that they let you punch through the floor of those abstractions without paying the usual cognitive tax. You can stay in your high-level &#8220;Python mindset&#8221;, thinking about tensors, shapes, and operations, while the AI handles the verbose, tricky syntax required to make the GPU do the heavy lifting.<\/p>\n\n\n\n<p>For a long time, writing custom CUDA kernels felt a bit like being a member of a secret society. You needed to understand the GPU architecture down to the thread level and worry about race conditions that would silently corrupt your data. It was the &#8220;Dark Arts&#8221; of deep learning reserved for a few brave enough to leave the safety of Python.<br>But recently, I\u2019ve found myself going down a rabbit hole that has completely changed how I look at high-performance computing. I\u2019ve been using Claude (specifically via the new terminal integrations and Claude Code workflows) to write <strong>Triton<\/strong> kernels, and honestly? It feels like having a GPU engineer sitting next to you who doesn&#8217;t mind explaining memory coalescing for the fifth time.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">From English to Pointers<\/h2>\n\n\n\n<p>Let\u2019s look at a practical scenario. Say you want to write a simple fused operation that isn\u2019t supported natively by PyTorch, or perhaps a custom activation function. In the past, I would have spent three hours reading the Triton docs and debugging segmentation faults. Now, the workflow looks something like this. You describe the mathematical operation and the memory constraints. <\/p>\n\n\n\n<p>&#8220;I need a Triton kernel that performs a fused addition and ReLU activation on two 1D vectors x and y. Handle the block pointers correctly so we don&#8217;t read out of bounds if the vector size isn&#8217;t a multiple of the block size.&#8221; And Claude spits out something that is usually 95% of the way there:<\/p>\n\n\n\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"python\" data-enlighter-theme=\"\" data-enlighter-highlight=\"\" data-enlighter-linenumbers=\"\" data-enlighter-lineoffset=\"\" data-enlighter-title=\"\" data-enlighter-group=\"\">import torch import triton import triton.language as tl\n\n@triton.jit \ndef add_relu_kernel( x_ptr, # Pointer to the first input vector \n                     y_ptr, # Pointer to the second input vector \n                     output_ptr, # Pointer to the output \n                     vector n_elements, # Size of the vector \n                     BLOCK_SIZE: tl.constexpr, # Number of elements each program should process ):\n           # There are multiple 'programs' processing data. The first thing we do is identify which program we are: pid = tl.program_id(axis=0) # We           use a 1D launch grid so axis is 0\n\n          # This program will process inputs that are offset from the initial data.\n# For instance, if you have a vector of length 256 and block_size of 64, the programs\n# will start at offsets 0, 64, 128, 192.\nblock_start = pid * BLOCK_SIZE\noffsets = block_start + tl.arange(0, BLOCK_SIZE)\n\n# Create a mask to guard memory access operations\nmask = offsets &lt; n_elements\n\n# Load x and y from DRAM, masking out any extra elements in case the input is not a multiple of the block size\nx = tl.load(x_ptr + offsets, mask=mask)\ny = tl.load(y_ptr + offsets, mask=mask)\n\noutput = x + y\n# Apply ReLU\noutput = tl.where(output > 0, output, 0)\n\n# Write x + y back to DRAM\ntl.store(output_ptr + offsets, output, mask=mask)<\/pre>\n\n\n\n<p>The beauty here isn&#8217;t just the code generation\u2014it&#8217;s the <strong>explanation of intent<\/strong>. Notice how it handles the <code>masking<\/code> logic? That is the classic foot-gun of GPU programming. If you try to read memory index 1025 in a vector of size 1024 because your block size is fixed, you crash. Claude usually anticipates this &#8220;boundary condition&#8221; complexity, treating the code not just as text, but as a physical mapping of data to hardware.<\/p>\n\n\n\n<p>Of course, you still need to verify the output. Benchmarking these kernels against standard PyTorch implementations is essential to ensure you&#8217;re actually getting a speedup (and not just a fancy, slow kernel). But setting up those benchmarking harnesses is also something Claude is surprisingly good at.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Why would you ever want to leave the warm, fuzzy embrace of torch.nn? It works, it\u2019s differentiable, and it rarely causes your entire Python session to segfault without a stack trace. The answer usually comes down to the &#8220;Memory Wall.&#8221; Modern deep learning is often less bound by how fast your GPU can do math [&hellip;]<\/p>\n","protected":false},"author":132,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[633,189,227],"tags":[755,648],"ppma_author":[819],"class_list":["post-13920","post","type-post","status-publish","format-standard","hentry","category-ai","category-machine-learning","category-python-code","tag-cuda","tag-technical"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":819,"user_id":132,"is_guest":0,"slug":"marius","display_name":"Marius Urbonas","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/f6cfacfef320206092f2d813679b535a9fe97f5f2f1bb339097459a41a352b87?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13920","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/132"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=13920"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13920\/revisions"}],"predecessor-version":[{"id":13961,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13920\/revisions\/13961"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=13920"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=13920"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=13920"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=13920"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}