Why would you ever want to leave the warm, fuzzy embrace of torch.nn? It works, it’s differentiable, and it rarely causes your entire Python session to segfault without a stack trace. The answer usually comes down to the “Memory Wall.” Modern deep learning is often less bound by how fast your GPU can do math (FLOPS) and more bound by how fast it can move data around (Memory Bandwidth). When you write a sequence of simple PyTorch operations, something like x = x * 2 + y the GPU often reads x from memory, multiplies it, writes it back, reads it again to add y, and writes it back again. It’s the computational equivalent of making five separate trips to the grocery store because you forgot the eggs, then the milk, then the bread. Writing a custom kernel lets you “fuse” these operations. You load the data once, perform a dozen mathematical operations on it while it sits in the ultra-fast chip registers, and write it back once. The performance gains can be massive (often 2x-10x for specific layers).But traditionally, the “cost” of accessing those gains, learning C++, understanding warp divergence, and manual memory management, was just too high for most researchers. That equation is finally changing.
Abstraction Sandwich
Usually, we try to solve complexity by piling on abstractions. We write in Python, which calls C++, which compiles down to assembly. We accept the overhead because writing assembly is painful. The amazing thing about using LLMs for kernel programming is that they let you punch through the floor of those abstractions without paying the usual cognitive tax. You can stay in your high-level “Python mindset”, thinking about tensors, shapes, and operations, while the AI handles the verbose, tricky syntax required to make the GPU do the heavy lifting.
For a long time, writing custom CUDA kernels felt a bit like being a member of a secret society. You needed to understand the GPU architecture down to the thread level and worry about race conditions that would silently corrupt your data. It was the “Dark Arts” of deep learning reserved for a few brave enough to leave the safety of Python.
But recently, I’ve found myself going down a rabbit hole that has completely changed how I look at high-performance computing. I’ve been using Claude (specifically via the new terminal integrations and Claude Code workflows) to write Triton kernels, and honestly? It feels like having a GPU engineer sitting next to you who doesn’t mind explaining memory coalescing for the fifth time.
From English to Pointers
Let’s look at a practical scenario. Say you want to write a simple fused operation that isn’t supported natively by PyTorch, or perhaps a custom activation function. In the past, I would have spent three hours reading the Triton docs and debugging segmentation faults. Now, the workflow looks something like this. You describe the mathematical operation and the memory constraints.
“I need a Triton kernel that performs a fused addition and ReLU activation on two 1D vectors x and y. Handle the block pointers correctly so we don’t read out of bounds if the vector size isn’t a multiple of the block size.” And Claude spits out something that is usually 95% of the way there:
import torch import triton import triton.language as tl
@triton.jit
def add_relu_kernel( x_ptr, # Pointer to the first input vector
y_ptr, # Pointer to the second input vector
output_ptr, # Pointer to the output
vector n_elements, # Size of the vector
BLOCK_SIZE: tl.constexpr, # Number of elements each program should process ):
# There are multiple 'programs' processing data. The first thing we do is identify which program we are: pid = tl.program_id(axis=0) # We use a 1D launch grid so axis is 0
# This program will process inputs that are offset from the initial data.
# For instance, if you have a vector of length 256 and block_size of 64, the programs
# will start at offsets 0, 64, 128, 192.
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
# Create a mask to guard memory access operations
mask = offsets < n_elements
# Load x and y from DRAM, masking out any extra elements in case the input is not a multiple of the block size
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = x + y
# Apply ReLU
output = tl.where(output > 0, output, 0)
# Write x + y back to DRAM
tl.store(output_ptr + offsets, output, mask=mask)
The beauty here isn’t just the code generation—it’s the explanation of intent. Notice how it handles the masking logic? That is the classic foot-gun of GPU programming. If you try to read memory index 1025 in a vector of size 1024 because your block size is fixed, you crash. Claude usually anticipates this “boundary condition” complexity, treating the code not just as text, but as a physical mapping of data to hardware.
Of course, you still need to verify the output. Benchmarking these kernels against standard PyTorch implementations is essential to ensure you’re actually getting a speedup (and not just a fancy, slow kernel). But setting up those benchmarking harnesses is also something Claude is surprisingly good at.
