A timeline of sampling methods of diffusion models

When approaching the methods used in de-novo protein design, one is quickly confronted with a plethora of overlapping formulations of what looks superficially like “the same thing”. One paper trains an ϵ\boldsymbol{\epsilon}-prediction network with a simple MSE loss; another trains a score network with a stochastic-differential-equation justification; a third trains a clean-data predictor under yet another schedule. Each formulation carries its own notation, its own variance schedule, and its own sampler. Qualitatively, this zoo of formulations is doing the same thing: it starts from some unstructured noise and iteratively refines it to eventually produce a protein structure similar (but different!) to other proteins we have experimentally determined in the past. What is not immediately obvious to a newcomer is that all of these formulations are historical descendants of a small number of foundational ideas, and that essentially every architectural and algorithmic decision in a modern protein-design diffusion model has a specific paper of origin and a specific motivation for being there.

This post is my attempt to put these formulations onto a single timeline. I trace the trajectory of the field through four foundational works: DDPM (Ho et al., 2020), DDIM (Song et al., 2021a), the score-based SDE unification (Song et al., 2021b), and EDM (Karras et al., 2022), explaining at each step what specific problem with the previous formulation the next paper was attacking and how the new formulation generalises or simplifies the old one. The goal is coherent motivation rather than exhaustive coverage; the reader interested in implementation details is referred to the original papers and the references at the end.

The problem we want to solve

Before diving into any specific method, let us be precise about the problem all of them are trying to solve. As mentioned above, we have a target distribution pdatap_{\text{data}} on Rd\mathbb{R}^d we are interested in sampling from. For our purposes, think the distribution over plausible protein backbones, side-chain configurations, or full structures. We cannot evaluate pdatap_{\text{data}} analytically and we cannot sample from it directly; what we have is a finite dataset of samples {x(i)}i=1Npdata\{\mathbf{x}^{(i)}\}_{i=1}^{N} \sim p_{\text{data}}. We want a generative model that produces fresh samples from pdatap_{\text{data}} at inference time.

The strategy that diffusion and score-based models all share is the same: introduce a tractable prior ppriorp_{\text{prior}} (typically an isotropic Gaussian N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I})) that we can sample from trivially, and learn a map that transforms samples from ppriorp_{\text{prior}} into samples from pdatap_{\text{data}}. The methods differ in how they construct this map: as the reverse of a noising Markov chain (DDPM), or as the time-integral of a reverse-time stochastic differential equation (score SDE), with the deterministic probability-flow ODE as a natural sibling. But the goal is identical, and the underlying object (a parameterised, time-dependent transport from a tractable prior to an intractable data distribution) is identical across all of them.

A useful equation to keep in mind throughout is the Gaussian probability path

xt  =  αtx0  +  σtϵ,ϵN(0,I),\mathbf{x}_t \;=\; \alpha_t\,\mathbf{x}_0 \;+\; \sigma_t\,\boldsymbol{\epsilon},\qquad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}),

shared by all the methods we will discuss. The differences between DDPM, DDIM, VP/VE-SDE, and EDM can largely be read off as different choices of (αt,σt)(\alpha_t, \sigma_t) and different ways of turning a learnt regression target back into a sample.

1. DDPM: Denoising as a Hierarchical VAE

The modern diffusion era begins with Ho, Jain & Abbeel’s “Denoising Diffusion Probabilistic Models” ([Ho et al., 2020](#ref-ho2020)), which built on the variational construction of Sohl-Dickstein et al. (2015) and simplified it into a recipe that could be trained reliably at the scale of CIFAR-10 and LSUN. The contribution is conceptually two-part: a clever choice of variational objective that turns generative modelling into a denoising problem, and an empirical finding that a simplified version of that objective trains better than the principled one.

The forward (noising) process

DDPM defines a fixed, parameter-free Markov chain that gradually corrupts a data point x0q(x0)\mathbf{x}_0 \sim q(\mathbf{x}_0) into Gaussian noise over TT discrete steps. With a variance schedule {βt}t=1T(0,1)\{\beta_t\}_{t=1}^{T}\subset(0,1),

q(xtxt1)  =  N ⁣(xt;1βtxt1,βtI).q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) \;=\; \mathcal{N}\!\bigl(\mathbf{x}_t;\sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\,\beta_t\mathbf{I}\bigr).

In the standard DDPM the schedule is linear: βt\beta_t grows linearly from β1=104\beta_1 = 10^{-4} to βT=2×102\beta_T = 2\times 10^{-2} over T=1000T = 1000 steps (Ho et al., 2020).

It is worth pausing to understand what this conditional is doing. The new sample xt\mathbf{x}_t is drawn from a Gaussian centred at 1βtxt1\sqrt{1-\beta_t}\,\mathbf{x}_{t-1} with variance βtI\beta_t \mathbf{I}. Two observations are crucial here:


  1. The mean of xt\mathbf{x}_t equals the mean of xt1\mathbf{x}_{t-1} multiplied by 1βt<1\sqrt{1-\beta_t} < 1. Iterating this forward, the mean of xt\mathbf{x}_t conditional on x0\mathbf{x}_0 is st1βsx0\prod_{s\le t} \sqrt{1-\beta_s}\cdot\mathbf{x}_0, a product of factors strictly less than one. Such a product converges to zero as tTt\to T. In words: the forward process gradually forgets the starting point and pushes the conditional mean toward the origin.



  2. The variance accumulates from zero towards one. A short recursive calculation (below) shows that if Var(x0)1\mathrm{Var}(\mathbf{x}_0) \le 1, the conditional variance of xtx0\mathbf{x}_t \mid \mathbf{x}_0 grows monotonically from 00 at t=0t=0 to a value close to 11 at t=Tt = T.


Putting these together: regardless of where x0\mathbf{x}_0 lives, q(xTx0)q(\mathbf{x}_T \mid \mathbf{x}_0) is approximately N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}) for any reasonable β\beta schedule. This is the property that makes the whole construction work.

The crucial algebraic fact about Gaussian Markov chains, and what makes diffusion tractable in the first place, is that the marginal at any step tt conditioned on x0\mathbf{x}_0 admits a closed form. Setting αt=1βt\alpha_t = 1-\beta_t and αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s,

q(xtx0)  =  N ⁣(xt;αˉtx0,(1αˉt)I)        xt  =  αˉtx0  +  1αˉtϵ,ϵN(0,I).q(\mathbf{x}_t \mid \mathbf{x}_0) \;=\; \mathcal{N}\!\bigl(\mathbf{x}_t;\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\,(1-\bar{\alpha}_t)\mathbf{I}\bigr) \;\;\Longleftrightarrow\;\; \mathbf{x}_t \;=\; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 \;+\; \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon},\quad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}).

This is the reparameterisation: rather than simulating tt forward steps, we can sample xt\mathbf{x}_t directly from x0\mathbf{x}_0 in one go, using a single Gaussian noise draw and a precomputed αˉt\bar{\alpha}_t.

Intuition: The forward process gradually destroys information by adding noise. The signal αˉtx0\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 is rescaled down as tt grows, while the noise 1αˉtϵ\sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon} is rescaled up; together they keep the total variance at unity (when Var(x0)=1\mathrm{Var}(\mathbf{x}_0) = 1). This is what we will later call a variance-preserving path.

It is worth pausing on how nontrivial this is. No matter where you start, a protein structure, an image of a cat, a noisy sketch, applying small Gaussian perturbations enough times always lands you at the same easy-to-sample distribution: an isotropic Gaussian. The forward direction is universal. The generative problem reduces to inverting this “universal contraction”: if we can learn to undo the corruption step-by-step, we can sample new data points by starting at the Gaussian prior and reversing the chain.

A latent-variable view, and why we need a variational bound

Suppose we want to use the reverse direction to draw samples. We define a generative model pθp_\theta that parameterises a reverse Markov chain,

pθ(x0:T)  =  p(xT)t=1Tpθ(xt1xt),p_\theta(\mathbf{x}_{0:T}) \;=\; p(\mathbf{x}_T)\,\prod_{t=1}^{T} p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t),

where the prior p(xT)=N(0,I)p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0},\mathbf{I}) matches the limiting marginal of the forward process, and each reverse transition is a Gaussian with parameters (μθ(xt,t),Σθ(xt,t))(\boldsymbol{\mu}_\theta(\mathbf{x}_t,t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t,t)) predicted by a neural network. To train, we want to maximise the likelihood pθ(x0)p_\theta(\mathbf{x}_0) of the observed data. Why the joint and not just pθ(x0)p_\theta(\mathbf{x}_0) directly? Because we cannot evaluate pθ(x0)p_\theta(\mathbf{x}_0) in closed form: it is the marginal of the joint over all intermediate states:

pθ(x0)  =  pθ(x0:T)dx1:T.p_\theta(\mathbf{x}_0) \;=\; \int p_\theta(\mathbf{x}_{0:T})\,\mathrm{d}\mathbf{x}_{1:T}.

The integral on the right is over TT high-dimensional latent variables and is intractable. This is the standard situation in latent-variable models, and the standard fix, due to the VAE literature, is to lower-bound the log-likelihood with the evidence lower bound (ELBO). We introduce a variational distribution q(x1:Tx0)q(\mathbf{x}_{1:T}\mid\mathbf{x}_0) over the latents, multiply and divide by it inside the integral, and apply Jensen’s inequality to the resulting log of an expectation:

logpθ(x0)  =  logq(x1:Tx0)pθ(x0:T)q(x1:Tx0)dx1:T    Eq ⁣[logpθ(x0:T)q(x1:Tx0)].\log p_\theta(\mathbf{x}_0) \;=\; \log \int q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)\,\frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\,\mathrm{d}\mathbf{x}_{1:T} \;\ge\; \mathbb{E}_{q}\!\left[\log\frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\right].

The expectation on the right is the evidence lower bound. Its negative is the variational lower bound loss,

LVLB  =  Eq ⁣[logpθ(x0:T)q(x1:Tx0)],\mathcal{L}_{\text{VLB}} \;=\; -\,\mathbb{E}_{q}\!\left[\log\frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\right],

which we minimise to push the model joint toward the data joint induced by qq.

What makes diffusion special compared to a generic VAE is that we fix the variational distribution qq to be the known forward process, not learnt, not parameterised, just the simple Gaussian chain we constructed above. This is the conceptual key: a diffusion model is a hierarchical VAE with TT latents whose encoder is frozen by design and whose decoder is a learnable reverse Markov chain.

The diagram above summarises the construction we have just walked through: a frozen forward Markov chain qq that destroys structure, and a learnable reverse chain pθp_\theta that reconstructs it.

Decomposing LVLB\mathcal{L}_{\text{VLB}}

The integral defining LVLB\mathcal{L}_{\text{VLB}} is still high-dimensional, but it admits a clean per-step decomposition. Start by expanding the ratio inside the log, using the factorisations of pθ(x0:T)p_\theta(\mathbf{x}_{0:T}) and q(x1:Tx0)q(\mathbf{x}_{1:T}\mid\mathbf{x}_0):

logpθ(x0:T)q(x1:Tx0)  =  logp(xT)+t=1Tlogpθ(xt1xt)q(xtxt1).\log\frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)} \;=\; \log p(\mathbf{x}_T) + \sum_{t=1}^{T}\log\frac{p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)}{q(\mathbf{x}_t\mid\mathbf{x}_{t-1})}.

The forward conditionals q(xtxt1)q(\mathbf{x}_t\mid\mathbf{x}_{t-1}) are not directly comparable to the reverse conditionals pθ(xt1xt)p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t): they go in opposite directions. To make them comparable, rewrite q(xtxt1)q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) for t2t \ge 2 using Bayes’ rule, conditioning everything on x0\mathbf{x}_0:

q(xtxt1)  =  q(xtxt1,x0)  =  q(xt1xt,x0)q(xtx0)q(xt1x0).q(\mathbf{x}_t\mid\mathbf{x}_{t-1}) \;=\; q(\mathbf{x}_t\mid\mathbf{x}_{t-1},\mathbf{x}_0) \;=\; \frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)\,q(\mathbf{x}_t\mid\mathbf{x}_0)}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}.

The first equality uses the Markov property of the forward chain; the second is Bayes. Substituting back and using a telescoping argument on the marginals (full derivation in [Ho et al., 2020, appendix A](#ref-ho2020)), one obtains

LVLB  =  DKL ⁣(q(xTx0)p(xT))LT  +  t=2TEq ⁣[DKL ⁣(q(xt1xt,x0)pθ(xt1xt))]Lt1  Eq ⁣[logpθ(x0x1)]L0.\mathcal{L}_{\text{VLB}} \;=\; \underbrace{D_{\mathrm{KL}}\!\bigl(q(\mathbf{x}_T\mid\mathbf{x}_0)\,\Vert\,p(\mathbf{x}_T)\bigr)}_{L_T} \;+\; \sum_{t=2}^{T}\underbrace{\mathbb{E}_{q}\!\left[D_{\mathrm{KL}}\!\bigl(q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)\,\Vert\,p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)\bigr)\right]}_{L_{t-1}} \;\underbrace{-\,\mathbb{E}_{q}\!\left[\log p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)\right]}_{L_0}.

This decomposition has a beautiful structure. LTL_T is essentially constant: it measures how close the forward marginal q(xTx0)q(\mathbf{x}_T\mid\mathbf{x}_0) is to the prior p(xT)=N(0,I)p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0},\mathbf{I}), and is approximately zero by construction. L0L_0 is a final-step log-likelihood that becomes negligible at high TT. The interesting terms are the per-step KL divergences Lt1L_{t-1}, which compare the true posterior q(xt1xt,x0)q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0), a Gaussian with closed form thanks to Bayes’ rule on Gaussians, against the model’s reverse conditional pθ(xt1xt)p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t).

The true posterior is

q(xt1xt,x0)  =  N(xt1;μ~t(xt,x0),β~tI),q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) \;=\; \mathcal{N}\bigl(\mathbf{x}_{t-1};\,\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0),\,\tilde{\beta}_t\mathbf{I}\bigr),

with closed-form posterior mean and variance

μ~t(xt,x0)  =  αˉt1βt1αˉtx0  +  αt(1αˉt1)1αˉtxt,β~t  =  1αˉt11αˉtβt.\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0) \;=\; \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\mathbf{x}_0 \;+\; \frac{\sqrt{\alpha_t}\,(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,\mathbf{x}_t,\qquad \tilde{\beta}_t \;=\; \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\,\beta_t.

Here we can notice how the new mean of the gaussian is an interpolation between the noisy state in which we are at point tt and the clean/unnoised sample at time zero!

From KL to MSE on means

A key fact, originally due to Feller (1949), is that when the forward step q(xtxt1)q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) is Gaussian with sufficiently small variance βt\beta_t, the exact reverse step q(xt1xt)q(\mathbf{x}_{t-1} \mid \mathbf{x}_t) is also approximately Gaussian. This is the formal justification for parameterising the model’s reverse conditional as a Gaussian with fixed covariance: pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1};\,\boldsymbol{\mu}_\theta(\mathbf{x}_t,t),\,\sigma_t^2\mathbf{I}) with σt2{βt,β~t}\sigma_t^2 \in \{\beta_t, \tilde{\beta}_t\}. The variance is fixed (not learned) to one of the two endpoints corresponding to the upper and lower bounds on the true reverse variance derived in Ho et al. (2020). The KL divergence between two Gaussians with the same covariance σt2I\sigma_t^2\mathbf{I} has a particularly simple form, it reduces to a scaled squared distance between the means:

DKL ⁣(N(μ~t,σt2I)N(μθ,σt2I))  =  12σt2μ~tμθ2.D_{\mathrm{KL}}\!\bigl(\mathcal{N}(\tilde{\boldsymbol{\mu}}_t,\sigma_t^2\mathbf{I})\,\Vert\,\mathcal{N}(\boldsymbol{\mu}_\theta,\sigma_t^2\mathbf{I})\bigr) \;=\; \frac{1}{2\sigma_t^2}\,\lVert\tilde{\boldsymbol{\mu}}_t – \boldsymbol{\mu}_\theta\rVert^2.

So the entire per-step loss reduces to an MSE between the model’s predicted mean and the true posterior mean. The strategic question is then: how should we parameterise the network’s output to make this MSE easy to learn?

Using the reparameterisation x0=(xt1αˉtϵ)/αˉt\mathbf{x}_0 = (\mathbf{x}_t – \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon})/\sqrt{\bar{\alpha}_t}, that is, expressing x0\mathbf{x}_0 in terms of the noise ϵ\boldsymbol{\epsilon} that was used to construct xt\mathbf{x}_t, and substituting into μ~t\tilde{\boldsymbol{\mu}}_t, mechanical algebra gives

μ~t(xt,ϵ)  =  1αt ⁣(xtβt1αˉtϵ).\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\boldsymbol{\epsilon}) \;=\; \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t – \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}\right).

This suggests parameterising the model’s mean by the same expression but with a learnt ϵθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t) in place of the true ϵ\boldsymbol{\epsilon}:

μθ(xt,t)  =  1αt ⁣(xtβt1αˉtϵθ(xt,t)).\boldsymbol{\mu}_\theta(\mathbf{x}_t,t) \;=\; \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t – \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right).

Substituting μ~t\tilde{\boldsymbol{\mu}}_t and μθ\boldsymbol{\mu}_\theta into the KL expression and simplifying gives, after cancellation,

Lt1  =  βt22σt2αt(1αˉt)Ex0,ϵ ⁣[ϵϵθ(xt,t)2].L_{t-1} \;=\; \frac{\beta_t^2}{2\,\sigma_t^2\,\alpha_t\,(1-\bar{\alpha}_t)}\,\mathbb{E}_{\mathbf{x}_0,\boldsymbol{\epsilon}}\!\left[\lVert\boldsymbol{\epsilon} – \boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\rVert^2\right].

Each per-step VLB term has thus reduced to a weighted MSE between the true noise (drawn when we constructed xt\mathbf{x}_t) and the network’s predicted noise. This is, structurally, vanilla supervised regression: predict ϵ\boldsymbol{\epsilon} given xt\mathbf{x}_t.

Ho et al. then made the empirical observation that dropping the time-dependent weight in front of this MSE produces strictly better samples than keeping the principled VLB weight. The resulting simple objective is the workhorse of modern diffusion:

Lsimple(θ)  =  Et,x0,ϵ[ϵϵθ(xt,t)2],xt=αˉtx0+1αˉtϵ.\mathcal{L}_{\text{simple}}(\theta) \;=\; \mathbb{E}_{t,\mathbf{x}_0,\boldsymbol{\epsilon}}\bigl[\,\lVert \boldsymbol{\epsilon} – \boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\rVert^2\,\bigr],\qquad \mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}.

The dropped weighting is largest for small tt (mild noise) and effectively upweights the harder high-noise levels, which empirically gives sharper samples (Ho et al., 2020).

This objective is extraordinarily efficient and scalable, and this is, in my view, the single most important reason diffusion models took off so quickly. Look at what training requires at each step: pick a data point x0\mathbf{x}_0 from the training set, sample a timestep tt uniformly, sample a single Gaussian noise vector ϵ\boldsymbol{\epsilon}, form xt=αˉtx0+1αˉtϵ\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon} using precomputed αˉt\bar{\alpha}_t tables, and regress with one network forward-backward pass. There is no need to simulate the trajectory. There is no MCMC inner loop, no ODE solver in training, no need to even store any state between iterations. The training cost per step is identical to that of a vanilla supervised regressor. This is the property that made diffusion scale from CIFAR-10 to ImageNet to text-conditional generation to protein design without much architectural rethinking.

Ancestral sampling

At inference time we want new samples from the trained model, so we must run the Markov chain backward. Start from xTN(0,I)\mathbf{x}_T\sim\mathcal{N}(\mathbf{0},\mathbf{I}) and for t=T,T1,,1t = T, T-1, \dots, 1 apply the ancestral sampler:

xt1  =  1αt ⁣(xtβt1αˉtϵθ(xt,t))  +  σtz,zN(0,I),    σt2{βt,β~t}.\mathbf{x}_{t-1} \;=\; \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t – \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right) \;+\; \sigma_t\,\mathbf{z},\qquad \mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}),\;\;\sigma_t^2\in\{\beta_t,\tilde{\beta}_t\}.

We can observe here as well how the new less noisy sample at time t1t-1 is indeed our sample at time tt minus the noise that the networks recognizes in the sample. Heuristically, we are doing a step towards the denoised sample (away from the noise).

A later refinement worth flagging here: [Nichol & Dhariwal (2021)](#ref-nichol2021) showed that a cosine schedule,

αˉt  =  f(t)f(0),f(t)  =  cos2 ⁣(t/T+s1+sπ2),s=0.008,\bar{\alpha}_t \;=\; \frac{f(t)}{f(0)},\qquad f(t) \;=\; \cos^2\!\Bigl(\tfrac{t/T+s}{1+s}\cdot\tfrac{\pi}{2}\Bigr),\qquad s = 0.008,

avoids the pathology of the linear schedule destroying low-resolution signal too quickly; αˉt\bar{\alpha}_t stays near 1 for longer in the early steps and then collapses smoothly. They also showed that learning a log-interpolation between βt\beta_t and β~t\tilde{\beta}_t for the reverse covariance improves log-likelihood. Both choices are now standard.

DDPM in the wild: the original RFDiffusion sampler

The original RFDiffusion (Watson et al., 2023), one of the model that launched de-novo protein design as we know it today, is in this DDPM lineage. Its inner sampling loop in rfdiffusion/inference/utils.py implements exactly the ancestral update we have just derived. The key helper, get_mu_xt_x0, computes the posterior mean μ~t(xt,x0)\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) and posterior variance β~t\tilde{\beta}_t from cached β\beta and αˉ\bar{\alpha} schedules:

def get_mu_xt_x0(xt, px0, t, beta_schedule, alphabar_schedule, eps=1e-6):
    """Given xt, predicted x0 and the timestep t, give mu of x(t-1)."""
    t_idx = t - 1
    # (1) Posterior variance: sigma = ((1 - alphabar_{t-1}) / (1 - alphabar_t)) * beta_t
    #     This is exactly tilde{beta}_t.
    sigma = (
        (1 - alphabar_schedule[t_idx - 1]) / (1 - alphabar_schedule[t_idx])
    ) * beta_schedule[t_idx]
    xt_ca  = xt[:, 1, :]    # C-alpha coordinates at time t
    px0_ca = px0[:, 1, :]   # network's prediction of clean C-alpha coordinates
    # (2) First term of the posterior mean: coefficient on x_0.
    a = (
        (torch.sqrt(alphabar_schedule[t_idx - 1] + eps) * beta_schedule[t_idx])
        / (1 - alphabar_schedule[t_idx])
    ) * px0_ca
    # (3) Second term of the posterior mean: coefficient on x_t,
    #     with sqrt(1 - beta_t) = sqrt(alpha_t).
    b = (
        (torch.sqrt(1 - beta_schedule[t_idx] + eps) * (1 - alphabar_schedule[t_idx - 1]))
        / (1 - alphabar_schedule[t_idx])
    ) * xt_ca
    mu = a + b
    return mu, sigma

And the helper that drives one ancestral step, get_next_ca, samples xt1\mathbf{x}_{t-1} from the Gaussian N(μ~t,β~tI)\mathcal{N}(\tilde{\boldsymbol{\mu}}_t, \tilde{\beta}_t\mathbf{I}) in one line:

def get_next_ca(xt, px0, t, ..., noise_scale=1.0):
    # Compute posterior mean and variance from the cached schedules.
    mu, sigma = get_mu_xt_x0(xt, px0, t, beta_schedule, alphabar_schedule)
    # (4) Ancestral sampling: x_{t-1} ~ N(mu, sigma * I).
    sampled_crds = torch.normal(mu, torch.sqrt(sigma * noise_scale))
    delta = sampled_crds - xt[:, 1, :]
    ...
    out_crds = xt + delta[:, None, :]
    return out_crds / crd_scale, delta / crd_scale

Each numbered line is a line-for-line transcription of an equation from Section 1.

  • Lines (1) is the closed-form DDPM posterior variance β~t=1αˉt11αˉtβt\tilde{\beta}_t = \tfrac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\,\beta_t, with alphabar_schedule and beta_schedule precomputed at startup.
  • Lines (2) and (3) are the two terms of the closed-form posterior mean
    μ~t(xt,x0)  =  αˉt1βt1αˉtx0  +  αt(1αˉt1)1αˉtxt,\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) \;=\; \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\mathbf{x}_0 \;+\; \frac{\sqrt{\alpha_t}\,(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,\mathbf{x}_t,
    with αt\sqrt{\alpha_t} written out as 1βt\sqrt{1-\beta_t} and x0\mathbf{x}_0 replaced by the network’s prediction px0 (an x0\mathbf{x}_0-parameterisation rather than ϵ\boldsymbol{\epsilon}-parameterisation, but as we will see in Section 2 the two are linear transformations of each other in this Gaussians setting).
  • Line (4) is the ancestral sampling step itself: draw xt1N(μ~t,β~tI)\mathbf{x}_{t-1} \sim \mathcal{N}(\tilde{\boldsymbol{\mu}}_t,\,\tilde{\beta}_t\mathbf{I}), exactly the per-step update from Section 1. The noise_scale knob is a protein-design-specific multiplier that lets the user trade off sample diversity against fidelity to the training distribution; setting noise_scale = 1.0 recovers the textbook DDPM sampler.

One little note to keep in mind here: RFDiffusion uses this Gaussian DDPM update only on the translational (C-alpha) part of the structure: rotations are diffused on SO(3)SO(3) with an IGSO(3) reverse process (the get_next_frames helper alongside get_next_ca in the same file), because the Gaussian formalism does not transfer cleanly to a curved manifold.

2. DDIM: Non-Markovian Generalisation and Deterministic Sampling

As mentioned above, the DDPM formulation allows for highly efficient training that can be parallelised and therefore scaled. At sampling time, however, the Markov structure of the process forces us to start from the tractable distribution at TT and traverse every step of the learned chain in reverse. This meant that even with a perfectly trained model one needed hundreds to thousands of network evaluations to produce a single sample. Song, Meng & Ermon (2021a) asked therefore the following question: does the DDPM training objective actually require the forward process to be Markov?

The answer is no, and the consequence is a sampler that converges in 20–50 steps instead of 1000.

A first observation: noise-prediction is denoised-image-prediction

Before deriving DDIM, it is worth absorbing the equivalence between two natural parameterisations of the network’s output. The forward marginal is

xt  =  αˉtx0  +  1αˉtϵ.\mathbf{x}_t \;=\; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 \;+\; \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}.

This is a linear relationship between three quantities (x0\mathbf{x}_0, xt\mathbf{x}_t, ϵ\boldsymbol{\epsilon}). Given any two of them we can solve for the third:

x0  =  xt1αˉtϵαˉt,ϵ  =  xtαˉtx01αˉt.\mathbf{x}_0 \;=\; \frac{\mathbf{x}_t – \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}}{\sqrt{\bar{\alpha}_t}},\qquad \boldsymbol{\epsilon} \;=\; \frac{\mathbf{x}_t – \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0}{\sqrt{1-\bar{\alpha}_t}}.

So a network that takes xt\mathbf{x}_t as input and predicts ϵ\boldsymbol{\epsilon} is equivalent to a network that predicts x0\mathbf{x}_0: each prediction is a fixed linear transformation of the other (with coefficients that depend on tt but not on the data). In particular, if the network outputs ϵθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t) we can immediately extract a denoised-image estimate

x^0(xt,t)  =  xt1αˉtϵθ(xt,t)αˉt.\hat{\mathbf{x}}_0(\mathbf{x}_t,t) \;=\; \frac{\mathbf{x}_t – \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)}{\sqrt{\bar{\alpha}_t}}.

This x^0\hat{\mathbf{x}}_0 is essentially the Tweedie-formula one-step denoiser: the network’s best guess of the clean image given the noisy observation. We will use it heavily in what follows.

Two equivalent ways to think about a denoising step. Both parameterisations describe exactly the same operation, but they invite different mental pictures.

  • Predict the clean image, then take a small step toward it. Feed xt\mathbf{x}_t to the network and ask “what does the clean image x0\mathbf{x}_0 look like?” Recall from the discussion of the true posterior mean above that one step of denoising is essentially an interpolation between the noisy state xt\mathbf{x}_t and the predicted clean state x^0\hat{\mathbf{x}}_0, with the interpolation weight controlled by the schedule. So a step of the reverse chain literally drags the sample a little closer to where the network thinks the data lives.
  • Predict the noise, then subtract a small fraction of it. Equivalently, ask “what noise was added to produce xt\mathbf{x}_t?” and remove a small portion of it, taking a step away from the noise direction. The two views are linked by the linear relation x0=(xt1αˉtϵ)/αˉt\mathbf{x}_0 = (\mathbf{x}_t – \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon})/\sqrt{\bar{\alpha}_t}, so the two updates are mathematically identical.

The DDPM paper picked ϵ\boldsymbol{\epsilon}-prediction because the regression target has unit variance by construction (the network is asked to predict an N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}) vector at every tt), which makes the network well-conditioned across the entire schedule. Different downstream methods will switch between parameterisations as convenient.

A family of non-Markovian forward processes

The DDPM loss Lsimple\mathcal{L}_{\text{simple}} depends only on the marginals q(xtx0)q(\mathbf{x}_t\mid\mathbf{x}_0): at each step we sample one xt\mathbf{x}_t from this marginal and regress. It does not depend on the joint q(x1:Tx0)q(\mathbf{x}_{1:T}\mid\mathbf{x}_0) in any way that involves the inter-step structure. Song et al. (2021a) leverage this: they consider an entire family of (possibly non-Markovian) forward processes that share the same per-step marginals as DDPM but differ in their posterior conditionals qσ(xt1xt,x0)q_\sigma(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0). Each member of this family corresponds to a valid sampler that re-uses the same trained noise predictor ϵθ\boldsymbol{\epsilon}_\theta, no retraining needed.

The family is indexed by a sequence of stochasticity parameters {σt}\{\sigma_t\} and posits

qσ(xt1xt,x0)  =  N ⁣(xt1;αˉt1x0+1αˉt1σt2xtαˉtx01αˉt,σt2I).q_\sigma(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) \;=\; \mathcal{N}\!\Bigl(\mathbf{x}_{t-1};\,\sqrt{\bar{\alpha}_{t-1}}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\cdot\tfrac{\mathbf{x}_t – \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0}{\sqrt{1-\bar{\alpha}_t}},\,\sigma_t^2\mathbf{I}\Bigr).

The structure of this conditional is worth absorbing. The mean is a sum of two terms:

  • αˉt1x0\sqrt{\bar{\alpha}_{t-1}}\,\mathbf{x}_0: the signal component at noise level t1t-1, i.e. where x0\mathbf{x}_0 would project under the forward marginal at time t1t-1;
  • 1αˉt1σt2xtαˉtx01αˉt\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\cdot\tfrac{\mathbf{x}_t – \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0}{\sqrt{1-\bar{\alpha}_t}}: a scaled version of the noise direction ϵ\boldsymbol{\epsilon} implied by the pair (xt,x0)(\mathbf{x}_t, \mathbf{x}_0).

Plus a Gaussian fluctuation of variance σt2I\sigma_t^2 \mathbf{I}. Song et al. (2021a) prove that, for any choice of {σt}\{\sigma_t\}, this family yields the same marginals q(xtx0)q(\mathbf{x}_t \mid \mathbf{x}_0) as DDPM, and therefore the same training objective and the same trained network are valid.

The DDIM sampler

Plugging in the network’s prediction x^0=(xt1αˉtϵθ)/αˉt\hat{\mathbf{x}}_0 = (\mathbf{x}_t – \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}_\theta)/\sqrt{\bar{\alpha}_t} in place of the true x0\mathbf{x}_0, we obtain the DDIM update in a very transparent form:

xt1  =  αˉt1x^0signal at level t1  +  1αˉt1σt2ϵθnoise direction, scaled  +  σtzfresh randomness,zN(0,I).\mathbf{x}_{t-1} \;=\; \underbrace{\sqrt{\bar{\alpha}_{t-1}}\,\hat{\mathbf{x}}_0}_{\text{signal at level $t-1$}} \;+\; \underbrace{\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\,\boldsymbol{\epsilon}_\theta}_{\text{noise direction, scaled}} \;+\; \underbrace{\sigma_t\,\mathbf{z}}_{\text{fresh randomness}},\qquad \mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}).

Compare directly with the DDPM ancestral update,

xt1  =  1αt ⁣(xtβt1αˉtϵθ)  +  σtDDPMz,\mathbf{x}_{t-1} \;=\; \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t – \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta\right) \;+\; \sigma_t^{\text{DDPM}}\,\mathbf{z},

and the structural difference becomes clear. The DDPM update is a local perturbation of xt\mathbf{x}_t whose coefficients 1/αt1/\sqrt{\alpha_t} and βt/1αˉt\beta_t/\sqrt{1-\bar{\alpha}_t} are calibrated for a single Δt=1\Delta t = 1 Markov step — the step size is baked into the formula. The DDIM update instead reconstructs xt1\mathbf{x}_{t-1} from scratch by combining the predicted clean image x^0\hat{\mathbf{x}}_0 (rescaled to the target noise level via αˉt1\sqrt{\bar{\alpha}_{t-1}}) with the predicted noise direction (rescaled by 1αˉt1σt2\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}). The target time index t1t-1 enters the right-hand side only through the cumulative noise level αˉt1\bar{\alpha}_{t-1}; the local step quantities βt\beta_t and αt\alpha_t do not appear explicitly at all. (The source time tt still enters, of course, through x^0\hat{\mathbf{x}}_0 and ϵθ\boldsymbol{\epsilon}_\theta, which depend on xt\mathbf{x}_t and αˉt\bar{\alpha}_t.)

The stochasticity parameter σt\sigma_t is conventionally written

σt(η)  =  η1αˉt11αˉt(1αˉtαˉt1),\sigma_t(\eta) \;=\; \eta\,\sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\,\Bigl(1-\tfrac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}\Bigr)},

with η[0,1]\eta\in[0,1] interpolating between two limits:

  • η=1\eta = 1 recovers the DDPM ancestral sampler (up to a covariance choice equivalent to β~t\tilde{\beta}_t);
  • η=0\eta = 0 gives the deterministic DDIM sampler, the workhorse of the modern fast-sampling regime:
xt1  =  αˉt1x^0(xt,t)  +  1αˉt1ϵθ(xt,t).\mathbf{x}_{t-1} \;=\; \sqrt{\bar{\alpha}_{t-1}}\,\hat{\mathbf{x}}_0(\mathbf{x}_t,t) \;+\; \sqrt{1-\bar{\alpha}_{t-1}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t).

Why DDIM allows fewer sampling steps

It is tempting to attribute DDIM’s step-skipping ability simply to “the trajectory is smooth.” That is part of the story but not the whole picture. Three observations, each with a clean intuitive content, together explain the effect.

(i) The update is anchored to a noise-level-agnostic prediction. Look again at the DDIM update: the right-hand side combines x^0\hat{\mathbf{x}}_0 (the network’s guess of the clean image) and ϵθ\boldsymbol{\epsilon}_\theta (the noise it sees) with coefficients αˉs\sqrt{\bar{\alpha}_{s}} and 1αˉsσs2\sqrt{1-\bar{\alpha}_{s}-\sigma_s^2} for any target noise level ss — and in the deterministic (η=0\eta = 0) case, this collapses to the clean form αˉs\sqrt{\bar{\alpha}_{s}} and 1αˉs\sqrt{1-\bar{\alpha}_{s}} displayed above. Crucially, x^0\hat{\mathbf{x}}_0 does not depend on which ss we want to jump to, it just depends on the current xt\mathbf{x}_t. So once the network has produced its best guess of the clean image, the formula can reassemble a sample at any lower noise level by simply remixing signal and noise in the proportions appropriate to that level. We can leap directly from t=1000t = 1000 to t=500t = 500 without visiting any intermediate state. (Song et al. formalise this in Section 4.2 / Appendix C.1 of the paper: the same update applies along any sub-sequence τ{1,,T}\tau \subset \{1,\dots,T\}, replacing (αˉt,αˉt1)(\bar{\alpha}_t, \bar{\alpha}_{t-1}) with (αˉτi,αˉτi1)(\bar{\alpha}_{\tau_i}, \bar{\alpha}_{\tau_{i-1}}).) Contrast this with the DDPM update, whose right-hand side contains βt/1αˉt\beta_t/\sqrt{1-\bar{\alpha}_t}. That coefficient encodes the size of a single small Markov step: it tells the sampler how much progress to make in one Δt=1\Delta t = 1 increment. There is no analogous “leap to ss” knob, because the DDPM step size is hard-coded into the formula itself.

(ii) DDIM is a first-order discretisation of an ODE, not an SDE. In the deterministic η=0\eta = 0 case there is no fresh noise injection at each step, so the only source of error is the truncation error of the ODE solver, i.e. the discrepancy between the discrete update and the continuous probability-flow trajectory we will discuss in Section 3. Truncation error scales smoothly with step size and accumulates additively. SDE discretisations are different: each step injects fresh Gaussian noise whose variance scales with the step size, so larger SDE steps mean noisier samples in a way that no amount of network accuracy can fix. ODE solvers do not suffer from this.

(iii) The probability-flow trajectories tend to be relatively low-curvature. Karras et al. (2022) later showed empirically that the deterministic trajectory from prior to data in diffusion models is approximately linear over much of its length, with most of the curvature concentrated near the data manifold. A nearly-straight trajectory is well-approximated by a few large Euler steps.

Together these three points explain the empirical observation that DDIM with 20–50 steps matches DDPM with 1000 steps at no cost to model retraining. DDIM also gives an invertible encoding from noise to data, which is the substrate of latent-space interpolation, editing, and exact likelihood evaluation via the change-of-variables formula.

Intuition: DDIM keeps the same marginals as DDPM but parameterises the reverse process around x^0\hat{\mathbf{x}}_0, the predicted clean image, rather than around xt\mathbf{x}_t. Each step says “given my best guess of the clean image, where would I be at noise level t1t-1?” This is invariant to step size: the formula is happy to jump to any target noise level. Removing the stochasticity (η=0\eta = 0) on top of this turns sampling into a deterministic trajectory you can traverse with bigger strides.

3. Score-Based Generative Modelling Through SDEs

A few months after DDIM, [Song, Sohl-Dickstein, Kingma, Kumar, Ermon & Poole (2021b)](#ref-song2021b) published the paper that, in retrospect, unified everything that came before. Two parallel threads had been developing in the literature:

  • The DDPM thread I described above, descending from variational hierarchical-VAE constructions and trained with the ϵ\epsilon-prediction MSE.
  • The NCSN / SMLD thread initiated by [Song & Ermon (2019)](#ref-songermon2019), “Noise Conditional Score Network” and “Score Matching with Langevin Dynamics”, which trained a network sθ(x,σ)xlogpσ(x)\mathbf{s}_\theta(\mathbf{x}, \sigma) \approx \nabla_\mathbf{x}\log p_\sigma(\mathbf{x}) to estimate the score function of a noise-corrupted version of the data distribution at many noise scales σ1>σ2>>σL\sigma_1 > \sigma_2 > \dots > \sigma_L. To sample, NCSN ran annealed Langevin dynamics: starting from a wide Gaussian, take some Langevin steps at the largest σ\sigma, anneal to the next smaller σ\sigma, repeat.

The training objective for NCSN was denoising score matching (Hyvärinen, 2005; Vincent, 2011), and the sampler was a discrete-time Langevin chain. Song et al. (2021b) showed that both DDPM and NCSN are discretisations of the same continuous-time stochastic differential equation. This unification opened the door to better samplers, exact likelihood evaluation via the probability-flow ODE, and a host of downstream methods including guidance and inverse-problem solvers.

Forward SDE: setup and three regimes

The forward corruption is now an Itô stochastic differential equation on t[0,T]t \in [0, T]:

dx  =  f(x,t)dt  +  g(t)dw,\mathrm{d}\mathbf{x} \;=\; \mathbf{f}(\mathbf{x},t)\,\mathrm{d}t \;+\; g(t)\,\mathrm{d}\mathbf{w},

where w\mathbf{w} is a standard dd-dimensional Wiener process (Brownian motion), f(x,t)\mathbf{f}(\mathbf{x},t) is the drift coefficient, and g(t)g(t) is the diffusion coefficient. Three concrete choices of (f,g)(\mathbf{f}, g) recover existing models:

(a) Variance Preserving (VP-SDE), the continuous-time limit of DDPM:

dx  =  12β(t)xdt  +  β(t)dw.\mathrm{d}\mathbf{x} \;=\; -\tfrac{1}{2}\beta(t)\,\mathbf{x}\,\mathrm{d}t \;+\; \sqrt{\beta(t)}\,\mathrm{d}\mathbf{w}.

This is an inhomogeneous Ornstein–Uhlenbeck (OU) process: a stochastic process whose drift is a linear restoring force pulling x\mathbf{x} toward the origin (with strength 12β(t)\tfrac{1}{2}\beta(t)), and whose diffusion term injects isotropic noise (of strength β(t)\sqrt{\beta(t)}). The “pull toward the origin” can be read directly off the drift coefficient 12β(t)x-\tfrac{1}{2}\beta(t)\,\mathbf{x}: it always points in the direction opposite to the current state x\mathbf{x}, with magnitude proportional to the distance from the origin. So if x\mathbf{x} sits far out along some axis, the deterministic part of the dynamics pushes it back toward zero; if x\mathbf{x} is already near the origin, the drift is small. This is the continuous-time analogue of the discrete forward step xt=1βtxt1+\mathbf{x}_t = \sqrt{1-\beta_t}\,\mathbf{x}_{t-1} + \dots from Section 1, where the factor 1βt<1\sqrt{1-\beta_t} < 1 was shrinking the mean toward zero at each step. The asymptotic stationary distribution as tt \to \infty is N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}), the Gaussian prior we want.

(b) Variance Exploding (VE-SDE), the continuous-time limit of NCSN / SMLD:

dx  =  dσ2(t)dtdw.\mathrm{d}\mathbf{x} \;=\; \sqrt{\tfrac{\mathrm{d}\,\sigma^2(t)}{\mathrm{d}t}}\,\mathrm{d}\mathbf{w}.

Notice: no drift. The data is never rescaled; only noise is added, at a rate determined by the schedule σ(t)\sigma(t). The marginal is xt=x0+σ(t)ϵ\mathbf{x}_t = \mathbf{x}_0 + \sigma(t)\,\boldsymbol{\epsilon}, and as σ(t)\sigma(t) \to \infty the marginal becomes a very wide Gaussian centred at x0\mathbf{x}_0: the variance literally explodes. The oppposite process will therefore look like a “huge gaussian collapsing” into a new generated sample

(c) sub-VP-SDE, introduced in the same paper:

dx  =  12β(t)xdt  +  β(t)(1e2 ⁣0tβ(s)ds)dw.\mathrm{d}\mathbf{x} \;=\; -\tfrac{1}{2}\beta(t)\,\mathbf{x}\,\mathrm{d}t \;+\; \sqrt{\beta(t)\bigl(1-e^{-2\!\int_0^t\beta(s)\,\mathrm{d}s}\bigr)}\,\mathrm{d}\mathbf{w}.

This is a hybrid with variance strictly less than the VP-SDE at every tt, empirically giving better likelihoods on CIFAR-10. We will not dwell on it but it’s worth knowing it exists.

DDPM is a discretisation of the VP-SDE

The claim that DDPM is the VP-SDE in disguise is worth working out, because the algebra is short and clarifying. Discretise the VP-SDE with the Euler–Maruyama scheme, the simplest first-order numerical scheme for SDEs, using a step Δt\Delta t:

xt+Δtxt    12β(t)xtΔt  +  β(t)Δtϵ,\mathbf{x}_{t+\Delta t} – \mathbf{x}_t \;\approx\; -\tfrac{1}{2}\beta(t)\,\mathbf{x}_t\,\Delta t \;+\; \sqrt{\beta(t)\,\Delta t}\,\boldsymbol{\epsilon},

so

xt+Δt    (112β(t)Δt)xt  +  β(t)Δtϵ.\mathbf{x}_{t+\Delta t} \;\approx\; \bigl(1 – \tfrac{1}{2}\beta(t)\,\Delta t\bigr)\,\mathbf{x}_t \;+\; \sqrt{\beta(t)\,\Delta t}\,\boldsymbol{\epsilon}.

Now identify the discrete DDPM coefficients with the continuous SDE coefficients via βtdisc=β(t)Δt\beta_t^{\text{disc}} = \beta(t)\,\Delta t (i.e. the discrete βt\beta_t in DDPM is β(t)\beta(t) times the time-step). For small βt\beta_t we have the Taylor expansion 1βt112βt\sqrt{1 – \beta_t} \approx 1 – \tfrac{1}{2}\beta_t, so

(112βtdisc)xt    1βtdiscxt.\bigl(1 – \tfrac{1}{2}\beta_t^{\text{disc}}\bigr)\,\mathbf{x}_t \;\approx\; \sqrt{1 – \beta_t^{\text{disc}}}\,\mathbf{x}_t.

The Euler–Maruyama discretisation thus reads

xt+Δt    1βtdiscxt  +  βtdiscϵ,\mathbf{x}_{t+\Delta t} \;\approx\; \sqrt{1 – \beta_t^{\text{disc}}}\,\mathbf{x}_t \;+\; \sqrt{\beta_t^{\text{disc}}}\,\boldsymbol{\epsilon},

which is exactly the DDPM forward step xt=αtxt1+βtϵ\mathbf{x}_t = \sqrt{\alpha_t}\,\mathbf{x}_{t-1} + \sqrt{\beta_t}\,\boldsymbol{\epsilon} from Section 1. So DDPM is literally the Euler–Maruyama discretisation of the VP-SDE with βtdisc=β(t)Δt\beta_t^{\text{disc}} = \beta(t)\,\Delta t.

Why is VP-SDE “variance preserving”?

Take the scalar case for clarity and compute the conditional variance recursively. From the forward step xt=1βtxt1+βtϵ\mathbf{x}_t = \sqrt{1-\beta_t}\,\mathbf{x}_{t-1} + \sqrt{\beta_t}\,\boldsymbol{\epsilon} with ϵN(0,1)\boldsymbol{\epsilon}\sim\mathcal{N}(0,1) independent of xt1\mathbf{x}_{t-1}:

Var(xt)  =  (1βt)Var(xt1)  +  βt.\mathrm{Var}(\mathbf{x}_t) \;=\; (1-\beta_t)\,\mathrm{Var}(\mathbf{x}_{t-1}) \;+\; \beta_t.

If Var(xt1)=1\mathrm{Var}(\mathbf{x}_{t-1}) = 1, then Var(xt)=(1βt)1+βt=1\mathrm{Var}(\mathbf{x}_t) = (1-\beta_t)\cdot 1 + \beta_t = 1. So once the variance reaches one, it stays exactly one, independently of the schedule {βt}\{\beta_t\}. This is the precise sense in which the VP-SDE preserves variance: the marginal variance is a fixed point of the forward dynamics at value 1. The forward process keeps every xt\mathbf{x}_t in roughly the same magnitude range as the data, which is convenient numerically.

How VE-SDE is the continuous limit of NCSN/SMLD

Now to the variance-exploding side. NCSN/SMLD as originally constructed (Song & Ermon, 2019) trains a network at LL discrete noise scales σ1>σ2>>σL\sigma_1 > \sigma_2 > \dots > \sigma_L: at scale σi\sigma_i the “data” is x+σiϵ\mathbf{x} + \sigma_i\boldsymbol{\epsilon}, with x\mathbf{x} drawn from the dataset. The network outputs an estimate of the score xlogpσi(x)\nabla_\mathbf{x}\log p_{\sigma_i}(\mathbf{x}) at each scale, where pσp_\sigma denotes the data distribution convolved with N(0,σ2I)\mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}).

To pass to the continuous-time version, let the discrete scales become a continuous schedule σ(t)\sigma(t) and ask: which SDE produces marginals xt=x0+σ(t)ϵ\mathbf{x}_t = \mathbf{x}_0 + \sigma(t)\,\boldsymbol{\epsilon} where ϵN(0,I)\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})?

Since the marginal mean is x0\mathbf{x}_0 (constant in tt), the drift must be zero: f(x,t)=0\mathbf{f}(\mathbf{x},t) = 0. The marginal variance is σ2(t)\sigma^2(t), and for an SDE with zero drift and diffusion g(t)g(t), the variance evolves as

ddtVar(xt)  =  g(t)2.\frac{\mathrm{d}}{\mathrm{d}t}\mathrm{Var}(\mathbf{x}_t) \;=\; g(t)^2.

Setting Var(xt)=σ2(t)\mathrm{Var}(\mathbf{x}_t) = \sigma^2(t) gives g(t)2=dσ2(t)dtg(t)^2 = \tfrac{\mathrm{d}\sigma^2(t)}{\mathrm{d}t}, hence

g(t)  =  dσ2(t)dt.g(t) \;=\; \sqrt{\frac{\mathrm{d}\,\sigma^2(t)}{\mathrm{d}t}}.

This is exactly the VE-SDE coefficient. The corresponding SDE is

dx  =  dσ2(t)dtdw,\mathrm{d}\mathbf{x} \;=\; \sqrt{\tfrac{\mathrm{d}\sigma^2(t)}{\mathrm{d}t}}\,\mathrm{d}\mathbf{w},

with marginal xt=x0+σ(t)ϵ\mathbf{x}_t = \mathbf{x}_0 + \sigma(t)\,\boldsymbol{\epsilon}, exactly the family of noise-corrupted distributions that NCSN trains on. So VE-SDE is the continuous schedule limit of NCSN’s discrete noise-scale construction: in the same way DDPM is the continuous limit of VP, NCSN is the continuous limit of VE.

The term “variance exploding” refers to the divergence of σ2(t)\sigma^2(t) at large tt: in standard NCSN setups σ\sigma is geometric with σmax\sigma_{\max} in the dozens or hundreds, so the marginal xt\mathbf{x}_t at late time has variance much greater than 1. In contrast, VP keeps variance bounded at 1.

Numerical example (VP vs VE at matched SNR)

The cleanest way to see the geometric difference between VP and VE is to match them at the same signal-to-noise ratio (SNR). For each, with σdata=1\sigma_{\text{data}} = 1:

SNRVP(t)  =  αˉt1αˉt,SNRVE(σ)  =  1σ2.\mathrm{SNR}_{\text{VP}}(t) \;=\; \frac{\bar{\alpha}_t}{1-\bar{\alpha}_t},\qquad \mathrm{SNR}_{\text{VE}}(\sigma) \;=\; \frac{1}{\sigma^2}.

Setting these equal gives the conversion σVE=(1αˉ)/αˉ\sigma_{\text{VE}} = \sqrt{(1-\bar{\alpha})/\bar{\alpha}}. Picking three noise levels and applying the same x0=1.0\mathbf{x}_0 = 1.0, ϵ=0.5\boldsymbol{\epsilon} = 0.5:

Noise level (SNR)αˉ\bar{\alpha}σVE\sigma_{\text{VE}}xt\mathbf{x}_t under VPxt\mathbf{x}_t under VEVar(xtx0)\mathrm{Var}(\mathbf{x}_t\mid \mathbf{x}_0) VPVar(xtx0)\mathrm{Var}(\mathbf{x}_t\mid \mathbf{x}_0) VE
Low (SNR=9)0.90.3330.91+0.10.5=1.107\sqrt{0.9}\cdot 1 + \sqrt{0.1}\cdot 0.5 = 1.1071+0.3330.5=1.1671 + 0.333\cdot 0.5 = 1.1670.100.111
Mid (SNR=1)0.51.0000.51+0.50.5=1.061\sqrt{0.5}\cdot 1 + \sqrt{0.5}\cdot 0.5 = 1.0611+1.0000.5=1.5001 + 1.000\cdot 0.5 = 1.5000.501.000
High (SNR=1/9)0.13.0000.11+0.90.5=0.791\sqrt{0.1}\cdot 1 + \sqrt{0.9}\cdot 0.5 = 0.7911+3.0000.5=2.5001 + 3.000\cdot 0.5 = 2.5000.909.000

Reading the table left-to-right: at every row, VP and VE encode the same amount of information about x0\mathbf{x}_0 (same SNR), but they live on completely different scales. VP keeps the magnitude of xt\mathbf{x}_t bounded near 1 across all noise levels. VE lets the magnitude grow without bound, at high noise the data is a vanishing perturbation of a wide Gaussian. Same ϵ\boldsymbol{\epsilon}, same information content, drastically different geometry.

Reverse-time SDE and the score function

The fundamental theoretical result of Song et al. (2021b) is the reversibility of the forward SDE. Due to Anderson (1982), every Itô diffusion admits a reverse-time companion whose drift is the original drift minus the score of the time-marginal:

dx  =  [f(x,t)g(t)2 ⁣xlogpt(x)]dt  +  g(t)dwˉ,\mathrm{d}\mathbf{x} \;=\; \bigl[\,\mathbf{f}(\mathbf{x},t) \,-\, g(t)^2\,\nabla_{\!\mathbf{x}}\log p_t(\mathbf{x})\,\bigr]\,\mathrm{d}t \;+\; g(t)\,\mathrm{d}\bar{\mathbf{w}},

where wˉ\bar{\mathbf{w}} is a reverse-time Brownian motion. If we can estimate the score xlogpt(x)\nabla_\mathbf{x}\log p_t(\mathbf{x}) at every tt (call this estimate sθ(x,t)\mathbf{s}_\theta(\mathbf{x},t)), we can simulate the reverse-time SDE numerically (Euler–Maruyama again) and produce a sample from p0pdatap_0 \approx p_{\text{data}} starting from xTpT\mathbf{x}_T\sim p_T.

The score is learned by denoising score matching (Vincent, 2011). The loss is

LDSM(θ)  =  Et,x0,ϵ[λ(t)sθ(xt,t) ⁣xtlogpt(xtx0)2].\mathcal{L}_{\text{DSM}}(\theta) \;=\; \mathbb{E}_{t,\,\mathbf{x}_0,\,\boldsymbol{\epsilon}}\Bigl[\lambda(t)\,\bigl\lVert\mathbf{s}_\theta(\mathbf{x}_t,t) – \nabla_{\!\mathbf{x}_t}\log p_t(\mathbf{x}_t\mid\mathbf{x}_0)\bigr\rVert^2\Bigr].

Because pt(xtx0)p_t(\mathbf{x}_t\mid\mathbf{x}_0) is a Gaussian with closed form, the conditional score is analytic. For VP at time tt, pt(xtx0)=N(αˉtx0,(1αˉt)I)p_t(\mathbf{x}_t\mid\mathbf{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I}), so

 ⁣xtlogpt(xtx0)  =  xtαˉtx01αˉt  =  ϵ1αˉt.\nabla_{\!\mathbf{x}_t}\log p_t(\mathbf{x}_t\mid\mathbf{x}_0) \;=\; -\,\frac{\mathbf{x}_t – \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0}{1-\bar{\alpha}_t} \;=\; -\,\frac{\boldsymbol{\epsilon}}{\sqrt{1-\bar{\alpha}_t}}.

This is the exact mathematical bridge to DDPM: predicting the noise ϵ\boldsymbol{\epsilon} and predicting the score  ⁣xtlogpt\nabla_{\!\mathbf{x}_t}\log p_t are the same regression problem up to a fixed scalar 1/1αˉt-1/\sqrt{1-\bar{\alpha}_t}. The DDPM simple loss and the NCSN denoising score matching loss differ only in the constant in front of the MSE, which is absorbed into the weighting function λ(t)\lambda(t).

Probability-flow ODE: same marginals, no noise

A deep observation from Song et al. (2021b): the family of stochastic processes that share the same marginals ptp_t as the reverse SDE is not unique. There is a privileged deterministic member, the probability-flow ODE:

dxdt  =  f(x,t)    12g(t)2 ⁣xlogpt(x).\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} \;=\; \mathbf{f}(\mathbf{x},t) \;-\; \tfrac{1}{2}\,g(t)^2\,\nabla_{\!\mathbf{x}}\log p_t(\mathbf{x}).

Solving this ODE backward in time from xTpT\mathbf{x}_T\sim p_T to x0\mathbf{x}_0 produces a sample with the same marginal distribution at every tt as the reverse SDE, but along a deterministic, invertible trajectory. The probability-flow ODE:

  • underlies DDIM, which is (up to a notational change) its first-order Euler discretisation in the VP regime;
  • enables exact log-likelihoods through the instantaneous change-of-variables formula tlogpt(xt)=(dx/dt)\partial_t \log p_t(\mathbf{x}_t) = -\nabla\cdot\bigl(\mathrm{d}\mathbf{x}/\mathrm{d}t\bigr);
  • is the substrate on which EDM will later build.

Intuition (score as force). The score xlogpt(x)\nabla_\mathbf{x}\log p_t(\mathbf{x}) is the gradient of the log-density: it points uphill in ptp_t, toward higher-density regions, i.e. toward the data manifold. In the reverse SDE, the term g(t)2logpt-g(t)^2\,\nabla\log p_t in the drift is therefore an attractive force pulling the sample toward the data manifold. The Brownian term g(t)dwˉg(t)\,\mathrm{d}\bar{\mathbf{w}} is exploratory: it injects random fluctuations that prevent the sample from collapsing onto any particular trajectory. The balance between these two, attraction toward data versus stochastic exploration, is exactly the exploration–exploitation tradeoff familiar from Langevin MCMC and from many sampling algorithms in statistical physics. The probability-flow ODE keeps the attraction and drops the exploration entirely, giving a deterministic trajectory that is the “expected” reverse path.

Predictor–corrector sampling and Langevin correctors

A practical contribution of Song et al. (2021b) that I find under-appreciated is predictor–corrector (PC) sampling. The motivation is that any numerical SDE solver makes errors at each step: the discretised sample at time tΔtt-\Delta t has a distribution that drifts from the true marginal ptΔtp_{t-\Delta t}. Over many steps these errors compound. The idea of PC is to correct the sample’s distribution back onto the true ptΔtp_{t-\Delta t} after each predictor step by running a small number of MCMC steps targeting ptΔtp_{t-\Delta t}, using the same learnt score we already have.

The corrector of choice is Langevin MCMC (a.k.a. unadjusted Langevin algorithm or ULA). To sample from a density pp, Langevin dynamics is the SDE

dx  =  12 ⁣logp(x)dt  +  dw.\mathrm{d}\mathbf{x} \;=\; \tfrac{1}{2}\nabla\!\log p(\mathbf{x})\,\mathrm{d}t \;+\; \mathrm{d}\mathbf{w}.

Its stationary distribution is pp. Discretising with the Euler–Maruyama scheme and a step size δ>0\delta > 0,

x    x+δ2 ⁣logp(x)  +  δz,zN(0,I).\mathbf{x} \;\leftarrow\; \mathbf{x} + \tfrac{\delta}{2}\,\nabla\!\log p(\mathbf{x}) \;+\; \sqrt{\delta}\,\mathbf{z},\qquad \mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}).

Here δ\delta is the Langevin step size, a hyperparameter that controls how aggressively the corrector moves toward higher-density regions per Langevin step. (It is not the SDE timestep Δt\Delta t, it is the discretisation parameter of a different SDE that runs at fixed outer time.) A typical PC sampling iteration looks like:

  1. Predictor: One reverse-SDE step from x\mathbf{x} at time tt to xpred\mathbf{x}^{\text{pred}} at time tΔtt – \Delta t (e.g. Euler–Maruyama with the learnt score).
  2. Corrector: KK Langevin steps at fixed time tΔtt – \Delta t targeting ptΔtp_{t-\Delta t}:
    xx+δ2sθ(x,tΔt)+δz.\mathbf{x} \leftarrow \mathbf{x} + \tfrac{\delta}{2}\,\mathbf{s}_\theta(\mathbf{x},\,t-\Delta t) + \sqrt{\delta}\,\mathbf{z}.

The number of Langevin steps KK and the step size δ\delta are tuned per problem; Song et al. (2021b) typically use K=1K = 1 and a small δ\delta, choosing it so that the “signal-to-noise ratio” of the Langevin update matches a target value. PC samplers gave a noticeable quality boost over plain Euler–Maruyama at the same total number of function evaluations, and the principle, interleave deterministic transport with score-driven stochastic correction, reappears in EDM’s stochastic sampler.

Why this looks familiar to anyone who has run a molecular dynamics simulation. Langevin dynamics is exactly the equation of motion used in MD with a stochastic thermostat: under the identification logp(x)U(x)/kBT\log p(\mathbf{x}) \leftrightarrow -U(\mathbf{x})/k_B T (with UU a potential energy and TT a temperature), the term  ⁣logp(x)\nabla\!\log p(\mathbf{x}) becomes the negative gradient of a potential, i.e. a force. So the Langevin step

x    x+δ2 ⁣logp(x)+δz\mathbf{x} \;\leftarrow\; \mathbf{x} + \tfrac{\delta}{2}\,\nabla\!\log p(\mathbf{x}) + \sqrt{\delta}\,\mathbf{z}

reads, in MD language, as “take a small step in the direction of the force, and add a thermal kick.” Just as MD with a Langevin thermostat samples the Boltzmann distribution eU/kBTe^{-U/k_B T} at long times, our corrector samples the target distribution pp at long times. The score sθ(x,t)\mathbf{s}_\theta(\mathbf{x},t) played by the neural network is, by analogy, a learnt force field: a position-dependent vector field whose flow lines lead the sampler toward high-density regions of ptp_t. The thermal noise δz\sqrt{\delta}\,\mathbf{z} prevents the sample from collapsing into a single energy minimum (a single mode of pp), exactly as it does in MD at finite temperature. This is a useful mental picture when you want to reason about ergodicity, mode mixing, or step-size choices in the sampler: most of the intuition built up in computational statistical mechanics transfers directly.

4. EDM: Elucidating (and Simplifying) the Design Space

By 2022 the diffusion literature had accumulated a small zoo of choices (VP vs VE, ϵ\epsilon– vs score- vs x0\mathbf{x}_0-prediction, linear vs cosine schedule, ancestral vs DDIM vs PC) and it was not clear which combinations mattered for sample quality and which were essentially cosmetic. Karras, Aittala, Aila & Laine (2022), in “Elucidating the Design Space of Diffusion-Based Generative Models” (EDM), attacked this head-on by re-deriving everything in a single, opinionated parameterisation and then ablating each design choice independently. The result is what has become the default in modern image-diffusion and protein-diffusion codebases.

Source. The exposition of the EDM sampler in this section follows closely the excellent talk by the EDM authors on the design space of diffusion samplers. I highly recommend watching it in full: it is the clearest single source I know of for the geometric intuition behind the σ\sigma-parameterisation, preconditioning, and Heun-plus-churn sampling.

The “x-ray view”: decoupling the design choices

The framing the EDM authors themselves use is worth absorbing. Before EDM, methods like VP, VE, iDDPM, and DDIM looked like tightly coupled packages you had to take as a whole: you couldn’t, say, combine the iDDPM noise schedule with the DDIM sampler without something breaking, because each package’s choices were implicitly tangled. EDM gives an x-ray view into the internals: it shows that any such method decomposes into a small set of orthogonal design choices,

  • the ODE / SDE solver (Euler, Heun, RK45, …),
  • the time-step discretisation (where to place the NN sampling points along σ\sigma),
  • the noise schedule σ(t)\sigma(t) (how σ\sigma grows with the abstract time variable),
  • the signal scaling s(t)s(t) (whether and how xt\mathbf{x}_t is rescaled),
  • the network preconditioning (cskip,cout,cin,cnoisec_{\text{skip}}, c_{\text{out}}, c_{\text{in}}, c_{\text{noise}}),
  • the training noise distribution p(σ)p(\sigma),
  • the loss weighting λ(σ)\lambda(\sigma),

each of which can be tuned in isolation, without retraining the network and without affecting the validity of the other choices. Armed with this decomposition, Karras et al. systematically searched for the best individual setting of each axis and stacked the improvements.

Sampling and training can be mixed and matched freely. An immediately useful practical consequence of the decoupling above: the training-time choices (preconditioning, loss weighting, noise distribution) and the sampling-time choices (solver, time-step schedule, churn) live on independent axes. You can take a network trained with the iDDPM noise distribution and weighting, and run it with the EDM Heun sampler at inference, no retraining required. Indeed, much of the empirical force of the EDM paper comes from precisely this experiment: Karras et al. apply their improved sampler to off-the-shelf pre-trained VP, VE, and iDDPM models and obtain large quality gains without touching the weights. The corollary is that sampler design and training design can be iterated on independently, a huge practical win, because re-training a diffusion model is expensive while changing the sampler is essentially free.

Before going through their choices it helps to name the two distinct error sources in sampling, which we want to analyse in isolation:

  1. Network (denoiser) error: even if numerical integration were exact, the denoiser DθD_\theta is only an approximation of the optimal denoiser, so it pushes the trajectory in a slightly wrong direction at each step.
  2. Truncation error: the deterministic trajectory we want to integrate is a curve in (x,σ)(\mathbf{x},\sigma)-space, and any solver discretises it with straight segments. Larger steps mean larger curvature-induced deviations from the true curve.

These two sources interact (worse network → larger drift → harder for the solver), but the sampler design controls primarily the second, while the training design controls primarily the first. EDM separates them cleanly.

The σ\sigma-parameterisation

EDM throws out the discrete time index tt and works directly with the noise level σ\sigma. The forward path is the simplest possible:

x(σ)  =  x0  +  σϵ,ϵN(0,I),\mathbf{x}(\sigma) \;=\; \mathbf{x}_0 \;+\; \sigma\,\boldsymbol{\epsilon},\qquad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}),

i.e. a variance-exploding path with σ\sigma playing both the role of “time” and “noise scale.” Karras et al. pick σ(t)=t\sigma(t) = t so that dσ=dt\mathrm{d}\sigma = \mathrm{d}t and no signal-scaling s(t)1s(t)\ne 1 is applied. Their justification, which I think is the single deepest insight of the paper, is given in the next subsection.

Two ways to deal with the signal-magnitude problem

A practical concern with the VE path is that for large σ\sigma, the magnitude of x=x0+σϵ\mathbf{x} = \mathbf{x}_0 + \sigma\boldsymbol{\epsilon} becomes huge, values in the tens or hundreds, well outside the comfortable unit-scale range that neural networks like to operate in. Naively feeding x\mathbf{x} at σ=80\sigma = 80 to a CNN is asking for unstable training.

There are exactly two principled cures for this, and they correspond to the two camps in the pre-EDM literature:

Cure 1 (VP-style): rescale the signal inside the forward path. Introduce an explicit signal scaling s(t)s(t) so that the actual sample fed to the network is s(t)x(t)s(t)\mathbf{x}(t), with s(t)s(t) chosen to keep the magnitude of s(t)x(t)s(t)\mathbf{x}(t) bounded. This is what VP does implicitly: setting xt=αˉtx0+1αˉtϵ\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon} is exactly a rescaled version of the VE path, with s(t)=αˉts(t) = \sqrt{\bar{\alpha}_t} squeezing the signal down to keep total variance at 1.

Cure 2 (VE + network preconditioning): leave the path alone, rescale at the network’s input and output. Keep the simple VE path x(σ)=x0+σϵ\mathbf{x}(\sigma) = \mathbf{x}_0 + \sigma\boldsymbol{\epsilon} in its raw form, but wrap the neural network so that its input is divided by σ2+σdata2\sqrt{\sigma^2 + \sigma_{\text{data}}^2} (giving unit-variance input) and its output is multiplied by an appropriate σ\sigma-dependent factor. From the network’s point of view it always sees well-scaled tensors, even though the underlying ODE state x\mathbf{x} may have huge magnitude.

Both cures solve the same numerical problem and let the network see well-scaled tensors throughout training and sampling. But they have very different geometric consequences for the underlying ODE trajectory. The VP rescaling distorts the natural flow lines of the ODE: it bends what would be a near-linear trajectory in x\mathbf{x}-space into a curved one. The VE-plus-preconditioning approach leaves the ODE alone (the flow lines retain whatever natural geometry they have) and absorbs the multi-scale problem inside the network. EDM argues, and demonstrates empirically, that the second cure is strictly better: straighter flow lines mean fewer Euler steps are needed for the same quality. This is why EDM commits to s(t)=1s(t) = 1 (no signal scaling in the ODE) and pushes all the magnitude management into preconditioning.

Deriving the EDM ODE from the forward path

The probability-flow ODE for a general SDE was given in Section 3. For the VE-SDE with f=0\mathbf{f} = 0 and g(t)=dσ2(t)/dtg(t) = \sqrt{\mathrm{d}\sigma^2(t)/\mathrm{d}t} it becomes

dxdt  =  12dσ2(t)dt ⁣xlogpt(x).\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} \;=\; -\,\tfrac{1}{2}\,\frac{\mathrm{d}\sigma^2(t)}{\mathrm{d}t}\,\nabla_{\!\mathbf{x}}\log p_t(\mathbf{x}).

Now invoke Tweedie’s formula: for x=x0+σϵ\mathbf{x} = \mathbf{x}_0 + \sigma\boldsymbol{\epsilon} with x0pdata\mathbf{x}_0\sim p_{\text{data}},

E[x0x]  =  x  +  σ2 ⁣xlogpσ(x),\mathbb{E}\bigl[\mathbf{x}_0\,\big|\,\mathbf{x}\bigr] \;=\; \mathbf{x} \;+\; \sigma^2\,\nabla_{\!\mathbf{x}}\log p_\sigma(\mathbf{x}),

equivalently  ⁣xlogpσ(x)=(D(x;σ)x)/σ2\nabla_{\!\mathbf{x}}\log p_\sigma(\mathbf{x}) = (D(\mathbf{x};\sigma) – \mathbf{x})/\sigma^2, where D(x;σ):=E[x0x]D(\mathbf{x};\sigma) := \mathbb{E}[\mathbf{x}_0\mid\mathbf{x}] is the optimal MMSE denoiser. MMSE stands for minimum mean squared error: among all measurable functions f(x)f(\mathbf{x}) predicting x0\mathbf{x}_0 from a noisy observation x\mathbf{x}, the conditional expectation E[x0x]\mathbb{E}[\mathbf{x}_0\mid\mathbf{x}] is the unique minimiser of the expected squared error Ef(x)x02\mathbb{E}\lVert f(\mathbf{x}) – \mathbf{x}_0\rVert^2. So D(x;σ)D(\mathbf{x};\sigma) is precisely the object that a network trained with an MSE loss would converge to in the infinite-data limit: predict the clean image given the noisy one. Tweedie’s formula tells us this denoiser and the score are two views of the same underlying quantity.

Substituting the Tweedie expression and using σ(t)=t\sigma(t) = t so that dσ2/dt=2t\mathrm{d}\sigma^2/\mathrm{d}t = 2t:

dxdt  =  122tD(x;t)xt2  =  xD(x;t)t.\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} \;=\; -\tfrac{1}{2}\cdot 2t\cdot\frac{D(\mathbf{x};t) – \mathbf{x}}{t^2} \;=\; \frac{\mathbf{x} – D(\mathbf{x};t)}{t}.

This is the EDM ODE, the most economical form one can write for the deterministic generative dynamics. The sign convention deserves a careful word, because at first glance the formula looks like it points the wrong way.

First, a clarification that is worth being explicit about. The forward noising introduced at the start of this section, x(σ)=x0+σϵ\mathbf{x}(\sigma) = \mathbf{x}_0 + \sigma\boldsymbol{\epsilon}, is the marginal of the stochastic VE-SDE dx=g(t)dw\mathrm{d}\mathbf{x} = g(t)\,\mathrm{d}\mathbf{w}. The equation dx/dt=(xD)/t\mathrm{d}\mathbf{x}/\mathrm{d}t = (\mathbf{x}-D)/t we just derived is a different equation: it is the probability-flow ODE, a deterministic dynamical system whose only relation to the noising SDE is that it shares the same marginal distributions pσp_\sigma at each noise level. They are not the same trajectory and not the same equation; they only agree on the one-time marginals.

The defining property of the PF-ODE, stated explicitly by Karras et al. (2022) (Section 2), is that integrating it from tat_a to tbt_beither forward or backward in tt — produces a sample distributed according to the marginal pσ(tb)p_{\sigma(t_b)}. Because the PF-ODE is deterministic, its solutions can be traversed in either direction of tt, and Karras et al. give the intuition in two complementary sentences: “an infinitesimal forward step of this ODE nudges the sample away from the data… equivalently, a backward step nudges the sample towards the data distribution.” So the same PF-ODE can be used either to corrupt (integrate with Δt>0\Delta t > 0, toward larger noise) or to generate (integrate with Δt<0\Delta t < 0, toward smaller noise).

Now interpret the formula in each direction. Read with increasing tt (increasing noise level σ=t\sigma = t), the velocity vector (xD)/t(\mathbf{x}-D)/t points in the direction xD\mathbf{x} – D, i.e. away from the denoiser’s estimate of the clean image — consistent with the corruption story (sample drifts away from the data manifold as σ\sigma grows). To generate samples we integrate with Δt<0\Delta t < 0, so a single Euler step is xx+Δt(xD)/t=xΔt(xD)/t\mathbf{x} \leftarrow \mathbf{x} + \Delta t \cdot (\mathbf{x}-D)/t = \mathbf{x} – |\Delta t|\cdot(\mathbf{x}-D)/t, which is a step in the direction DxD – \mathbf{x}, toward the current best guess of the clean image. The sampler does the intuitive thing.

The apparent singularity at t0t \to 0 also deserves a closer look. The numerator xD(x;t)\mathbf{x} – D(\mathbf{x};t) and the denominator tt both vanish: as t0t \to 0 the optimal denoiser becomes the identity (D(x;0)=xD(\mathbf{x};0) = \mathbf{x}), so xD(x;t)0\mathbf{x} – D(\mathbf{x};t) \to 0. The two zeros race each other and the resulting limit is finite. To see this concretely, expand DD to first order in tt: for small tt, D(x;t)xt2 ⁣logpt(x)D(\mathbf{x};t) \approx \mathbf{x} – t^2\,\nabla\!\log p_t(\mathbf{x}) (this is just Tweedie’s formula again, rearranged), so

xD(x;t)t    t ⁣logpt(x)  t0  0.\frac{\mathbf{x} – D(\mathbf{x};t)}{t} \;\approx\; t\,\nabla\!\log p_t(\mathbf{x}) \;\xrightarrow[t \to 0]{}\; 0.

The velocity therefore vanishes as t0t \to 0, and the trajectory smoothly settles onto the data manifold rather than blowing up.

The tangent points directly to the denoiser output: why this matters. Suppose at time tt we extrapolate a single linear Euler step from x(t)\mathbf{x}(t) all the way down to t=0t = 0. Plugging into dx/dt=(xD)/t\mathrm{d}\mathbf{x}/\mathrm{d}t = (\mathbf{x} – D)/t,

x(0)    x(t)+(0t)x(t)D(x(t);t)t  =  D(x(t);t).\mathbf{x}(0) \;\approx\; \mathbf{x}(t) + (0 – t)\cdot\frac{\mathbf{x}(t) – D(\mathbf{x}(t);t)}{t} \;=\; D(\mathbf{x}(t);\,t).

So with the σ(t)=t\sigma(t)=t schedule, the tangent direction points exactly to the current denoiser output. This is not true for other schedules: for VP, or for any choice with signal-rescaling s(t)1s(t) \ne 1, the tangent points to some weird intermediate target that depends on the schedule’s curvature. Empirically, the denoiser output D(x(t);t)D(\mathbf{x}(t);t), the model’s best guess of the clean image, changes very slowly along the sampling trajectory. Once you have an approximately right guess of the clean image, that guess remains approximately right as you anneal σ\sigma downward. So the tangent direction is roughly constant across steps, which is exactly the condition under which large Euler steps work well. This is the deepest answer to “why does DDIM-style ODE sampling take such big strides?”: the dynamics really want this schedule.

Preconditioning: deriving the EDM coefficients

The single most empirically impactful contribution of EDM is the explicit preconditioning of the network. Rather than asking a raw network FθF_\theta to output ϵ\boldsymbol{\epsilon} or x0\mathbf{x}_0 across many orders of magnitude of σ\sigma, Karras et al. write

Dθ(x;σ)  =  cskip(σ)x  +  cout(σ)Fθ(cin(σ)x,cnoise(σ)),D_\theta(\mathbf{x};\sigma) \;=\; c_{\text{skip}}(\sigma)\,\mathbf{x} \;+\; c_{\text{out}}(\sigma)\,F_\theta\bigl(c_{\text{in}}(\sigma)\,\mathbf{x},\,c_{\text{noise}}(\sigma)\bigr),

and derive the coefficients cskip,cout,cin,cnoisec_{\text{skip}}, c_{\text{out}}, c_{\text{in}}, c_{\text{noise}} analytically from two simple desiderata: the network’s input should have unit variance, and the network’s training target should have unit variance. Let me walk through the derivation.

Assume x0\mathbf{x}_0 has variance σdata2\sigma_{\text{data}}^2 per coordinate (typically σdata=0.5\sigma_{\text{data}} = 0.5 for normalised images). Then x=x0+σϵ\mathbf{x} = \mathbf{x}_0 + \sigma\boldsymbol{\epsilon} has variance σdata2+σ2\sigma_{\text{data}}^2 + \sigma^2. For the network’s input cin(σ)xc_{\text{in}}(\sigma)\mathbf{x} to have unit variance,

cin(σ)  =  1σ2+σdata2.c_{\text{in}}(\sigma) \;=\; \frac{1}{\sqrt{\sigma^2 + \sigma_{\text{data}}^2}}.

The training loss is E[λ(σ)Dθ(x;σ)x02]\mathbb{E}\bigl[\lambda(\sigma)\,\lVert D_\theta(\mathbf{x};\sigma) – \mathbf{x}_0\rVert^2\bigr]. Substituting the preconditioned form and rearranging, the network’s effective target is

Fθ(σ)  =  x0cskip(σ)xcout(σ).F_\theta^{\,\star}(\sigma) \;=\; \frac{\mathbf{x}_0 – c_{\text{skip}}(\sigma)\,\mathbf{x}}{c_{\text{out}}(\sigma)}.

We want Var(Fθ)=1\mathrm{Var}(F_\theta^{\,\star}) = 1. Expanding the numerator,

x0cskipx  =  x0cskip(x0+σϵ)  =  (1cskip)x0cskipσϵ,\mathbf{x}_0 – c_{\text{skip}}\mathbf{x} \;=\; \mathbf{x}_0 – c_{\text{skip}}(\mathbf{x}_0 + \sigma\boldsymbol{\epsilon}) \;=\; (1-c_{\text{skip}})\,\mathbf{x}_0 \,-\, c_{\text{skip}}\,\sigma\,\boldsymbol{\epsilon},

with variance (1cskip)2σdata2+cskip2σ2(1-c_{\text{skip}})^2 \sigma_{\text{data}}^2 + c_{\text{skip}}^2 \sigma^2. To minimise the magnitude of this target across σ\sigma, equivalently, to give the network the least extreme regression problem, we choose cskipc_{\text{skip}} to minimise this variance. Setting the derivative with respect to cskipc_{\text{skip}} to zero:

2(1cskip)σdata2+2cskipσ2=0        cskip(σ)  =  σdata2σ2+σdata2.-2(1-c_{\text{skip}})\sigma_{\text{data}}^2 + 2\,c_{\text{skip}}\,\sigma^2 = 0 \;\;\Longrightarrow\;\; c_{\text{skip}}(\sigma) \;=\; \frac{\sigma_{\text{data}}^2}{\sigma^2 + \sigma_{\text{data}}^2}.

Substituting back, the residual variance is

(1cskip)2σdata2+cskip2σ2  =  σ2σdata2σ2+σdata2.(1-c_{\text{skip}})^2 \sigma_{\text{data}}^2 + c_{\text{skip}}^2 \sigma^2 \;=\; \frac{\sigma^2\,\sigma_{\text{data}}^2}{\sigma^2 + \sigma_{\text{data}}^2}.

For unit variance of the network target we need cout2=σ2σdata2/(σ2+σdata2)c_{\text{out}}^2 = \sigma^2\sigma_{\text{data}}^2/(\sigma^2 + \sigma_{\text{data}}^2), hence

cout(σ)  =  σσdataσ2+σdata2.c_{\text{out}}(\sigma) \;=\; \frac{\sigma\cdot\sigma_{\text{data}}}{\sqrt{\sigma^2 + \sigma_{\text{data}}^2}}.

Finally, cnoisec_{\text{noise}} is the input embedding of the noise level, for which Karras et al. use the heuristic cnoise(σ)=14ln(σ)c_{\text{noise}}(\sigma) = \tfrac{1}{4}\ln(\sigma).

Why this matters in practice: the two extremes. The behaviour of these coefficients at small and large σ\sigma explains why preconditioning is empirically transformative.

  • At small σ\sigma (clean image, σσdata\sigma \ll \sigma_{\text{data}}): cskip1c_{\text{skip}} \approx 1 and coutσc_{\text{out}} \approx \sigma. The architecture essentially copies the input through the skip, and the network’s output is a small correction scaled down by σ\sigma. If the network makes an error, that error is multiplied by a small number on its way out, downscaled, not amplified. The network is freed from having to learn the identity function (a real risk in naive setups) and can focus on producing a small denoising correction.
  • At large σ\sigma (mostly noise, σσdata\sigma \gg \sigma_{\text{data}}): cskip0c_{\text{skip}} \approx 0 and coutσdatac_{\text{out}} \approx \sigma_{\text{data}}. The skip is effectively disabled, and the network is asked to predict the clean signal directly. This avoids the disaster of a naive noise-prediction setup, where the network would predict ϵ-\boldsymbol{\epsilon} and have its output multiplied by σ\sigma, meaning a network error of order ee becomes an output error of order σe\sigma e, amplified by a potentially huge factor when σ\sigma is in the dozens.

The analytic cskipc_{\text{skip}} smoothly interpolates between these two regimes, automatically selecting “predict residual / predict signal” depending on how noisy the input is. The transition happens around σσdata\sigma \sim \sigma_{\text{data}}. The neural network always sees well-scaled inputs and produces well-scaled outputs; the brunt of the multi-scale problem is absorbed by an analytic skip connection.

A further upside the EDM authors emphasise: because the preconditioning analytically handles the scale variation, the architecture of the inner network FθF_\theta is now essentially free to be whatever you want (DDPM-style U-Net, NCSN-style, transformer, etc.) and the preconditioning works for “any” architecture. EDM cleanly separates “what the network is” from “how its inputs and outputs are scaled.”

The EDM sampler: Heun with stochastic churn

For sampling, EDM proposes a deterministic and a stochastic variant of the same core procedure. The deterministic version is Heun’s second-order ODE solver applied to the EDM ODE: for each step from tit_i to ti+1t_{i+1},

di  =  xiDθ(xi;ti)ti,x~i+1  =  xi+(ti+1ti)di,\mathbf{d}_i \;=\; \frac{\mathbf{x}_i – D_\theta(\mathbf{x}_i; t_i)}{t_i},\qquad \tilde{\mathbf{x}}_{i+1} \;=\; \mathbf{x}_i + (t_{i+1}-t_i)\,\mathbf{d}_i,
di+1  =  x~i+1Dθ(x~i+1;ti+1)ti+1,xi+1  =  xi+(ti+1ti)12(di+di+1).\mathbf{d}_{i+1}’ \;=\; \frac{\tilde{\mathbf{x}}_{i+1} – D_\theta(\tilde{\mathbf{x}}_{i+1};t_{i+1})}{t_{i+1}},\qquad \mathbf{x}_{i+1} \;=\; \mathbf{x}_i + (t_{i+1}-t_i)\cdot\tfrac{1}{2}\bigl(\mathbf{d}_i + \mathbf{d}_{i+1}’\bigr).

Heun is a predictor–corrector scheme in the numerical-analysis sense: take a tentative Euler step, evaluate the slope at the predicted endpoint, then re-step using the average of the two slopes. The second-order trapezoidal correction halves the truncation error per step compared to plain Euler (or first-order DDIM) at the cost of one extra denoiser evaluation per step. Karras et al. tested various higher-order solvers (RK45, linear multistep methods) and reported that Heun strikes the best quality-per-NFE tradeoff.

The stochastic variant adds Langevin-style churn: at the start of each step, inflate the current sample’s noise level by a small factor, then take a deterministic Heun step from the inflated level back down to ti+1t_{i+1}. Concretely:

γi=min ⁣(Schurn/N,21) if SmintiSmax, else 0,\gamma_i = \min\!\bigl(S_{\text{churn}}/N,\,\sqrt{2}-1\bigr) \text{ if } S_{\text{min}}\le t_i\le S_{\text{max}}, \text{ else } 0,
t^i=ti(1+γi),x^i=xi+t^i2ti2Snoiseϵ,\hat{t}_i = t_i(1+\gamma_i),\qquad \hat{\mathbf{x}}_i = \mathbf{x}_i + \sqrt{\hat{t}_i^2 – t_i^2}\cdot S_{\text{noise}}\cdot\boldsymbol{\epsilon},

followed by a Heun step from (x^i,t^i)(\hat{\mathbf{x}}_i, \hat{t}_i) to (xi+1,ti+1)(\mathbf{x}_{i+1}, t_{i+1}). The hyperparameters (Schurn,Smin,Smax,Snoise)(S_{\text{churn}}, S_{\text{min}}, S_{\text{max}}, S_{\text{noise}}) control how much, where in the schedule, and with what amplification noise is reinjected. The interpretation is exactly the predictor–corrector logic from Section 3 in a different disguise: the noise injection is the “Langevin exploration” that washes out deterministic drift error; the Heun step is the deterministic transport that follows the ODE flow lines. Putting them together at each step is empirically more controllable than running a full reverse-SDE solver.

The Karras σ\sigma-schedule: spend more steps near the data

The noise levels themselves are not linear or geometric but follow a power law parameterised by ρ\rho:

σi  =  (σmax1/ρ+iN1(σmin1/ρσmax1/ρ))ρ,i=0,,N1.\sigma_i \;=\; \Bigl(\sigma_{\max}^{1/\rho} + \tfrac{i}{N-1}\bigl(\sigma_{\min}^{1/\rho} – \sigma_{\max}^{1/\rho}\bigr)\Bigr)^{\rho},\qquad i = 0,\dots,N-1.

The defaults σmin=0.002\sigma_{\min} = 0.002, σmax=80\sigma_{\max} = 80, σdata=0.5\sigma_{\text{data}} = 0.5, and ρ=7\rho = 7 allocate disproportionately many steps to the low-σ\sigma end, where the sample is close to the denoised image. The motivation is twofold:

  1. Close to the data manifold, the ODE trajectory has its highest curvature. Far from the data the dynamics are gentle and approximately linear (recall that the EDM tangent points to the denoiser output, which barely changes). Close to the data the trajectory “locks onto” a specific sample, and the denoiser output begins to commit to fine details; this is where the flow lines turn. Linear Euler steps are most error-prone in regions of high curvature, so we want fine resolution there.
  2. Small errors near the data are perceptually catastrophic. An error of magnitude ee at small σ\sigma produces a visible artifact in the final sample (low σ\sigma means xt\mathbf{x}_t is almost the final image, and any deviation will appear in the output as-is). The same error at large σ\sigma gets washed out by subsequent denoising. Putting more steps where they matter gives much better quality per NFE.

Increasing ρ\rho makes the allocation more lopsided toward small σ\sigma. The default ρ=7\rho = 7 was found by ablation; “Karras sigmas” with this schedule are now the default in essentially every modern image-diffusion pipeline.

Training: loss weighting and the noise level distribution

EDM’s training prescription is short but important enough that it deserves explicit treatment. Two choices complement the architectural preconditioning.

The noise level distribution during training is log-normal:

ln(σ)    N(Pmean,Pstd2),Pmean=1.2,  Pstd=1.2.\ln(\sigma) \;\sim\; \mathcal{N}(P_{\text{mean}},\,P_{\text{std}}^2),\qquad P_{\text{mean}} = -1.2,\;P_{\text{std}} = 1.2.

The motivation here is the mirror image of the sampling-schedule argument above. Empirically, the per-σ\sigma training loss has a roughly U-shape: at very low σ\sigma (the input is almost clean) the denoising task is trivial (any reasonable network produces a near-identity mapping) and there is little gradient signal to learn from. At very high σ\sigma (the input is mostly noise) the denoising task is essentially hopeless (no signal to recover) and gradients are again uninformative. The sweet spot, where the network can actually make progress and gradients carry real information, sits in the middle of the σ\sigma range. The log-normal distribution above concentrates training samples in exactly this regime, putting computational effort where the model can actually learn something.

Note that the sampling and training schedules answer different questions. Sampling allocates more steps near small σ\sigma because that is where truncation error is most damaging. Training allocates more samples near intermediate σ\sigma because that is where gradient information is richest. The two need not coincide, and in EDM they don’t.

The loss weighting that complements this is

λ(σ)  =  σ2+σdata2(σσdata)2.\lambda(\sigma) \;=\; \frac{\sigma^2 + \sigma_{\text{data}}^2}{(\sigma\cdot\sigma_{\text{data}})^2}.

This is chosen to equalise the gradient magnitude across noise levels. Without proper weighting, a naive Fθtarget2\lVert F_\theta – \text{target}\rVert^2 loss has gradients that vary by orders of magnitude across σ\sigma, leading to highly imbalanced training: most updates push the network gently while occasional high-magnitude updates dominate. The weighting λ(σ)\lambda(\sigma), combined with the unit-variance preconditioning, produces gradient updates of comparable magnitude at every σ\sigma, which empirically gives much more stable training dynamics.

An important observation on stochasticity, paraphrased from the EDM authors’ talk. The stochastic-churn sampler is empirically helpful, but it is a double-edged sword. The Langevin churn introduces its own discretisation error on top of the deterministic Heun step, so the net benefit depends on a delicate balance, and the optimal churn hyperparameters (Schurn,Smin,Smax,Snoise)(S_{\text{churn}}, S_{\text{min}}, S_{\text{max}}, S_{\text{noise}}) need to be tuned per dataset and per architecture, with no clean principles for how to do this. Worse, churn can mask bugs and modelling errors: a model with a slight bias in its denoiser can still produce reasonable samples under enough exploration, even when its deterministic samples reveal that something is wrong.

The authors’ recommendation, which I have come to follow in my own protein-design work, is: develop and debug in fully deterministic mode (Heun on the PF-ODE), and add stochasticity as “the final cherry on top” only after you are confident the deterministic samples look right. As a calibration point, the EDM CIFAR-10 model benefits negligibly from churn at its trained quality (the network is good enough that any exploration injects more error than it fixes), while their ImageNet-64 model still benefits, a useful empirical pointer that the marginal value of churn diminishes as the score network gets better.

EDM in the wild: the RFDiffusion3 inference sampler

To make the relevance of all this concrete, let us look at how the state-of-the-art protein-structure generator RFDiffusion3 implements its sampler. The relevant code lives in inference_sampler.py in the RosettaCommons foundry repository, in a method tellingly named sample_diffusion_like_af3. Stripping out the protein-specific bookkeeping (motif handling, classifier-free guidance, recycling, logging), the inner loop is essentially this:

for step_num, (c_t_minus_1, c_t) in enumerate(zip(noise_schedule, noise_schedule[1:])):

    # (1) Stochastic churn: inflate the current noise level by (1 + gamma).
    gamma = self.gamma_0 if c_t > self.gamma_min else 0
    t_hat = c_t_minus_1 * (gamma + 1)

    # (2) Inject noise to match the inflated level: epsilon ~ sqrt(t_hat^2 - t^2) * z.
    epsilon_L = (
        self.noise_scale
        * torch.sqrt(torch.square(t_hat) - torch.square(c_t_minus_1))
        * torch.normal(mean=0.0, std=1.0, size=X_L.shape, device=X_L.device)
    )
    X_noisy_L = X_L + epsilon_L

    # (3) Call the preconditioned denoiser D_theta(x; t_hat) on the inflated state.
    X_denoised_L = diffusion_module(X_noisy_L=X_noisy_L, t=t_hat.tile(D), ...)

    # (4) EDM tangent direction: (x - D(x; t_hat)) / t_hat.
    delta_L = (X_noisy_L - X_denoised_L) / t_hat   # "gradient of x wrt. t at x_t_hat"
    d_t = c_t - t_hat

    # (5) First-order Euler step along the EDM ODE.
    X_L = X_noisy_L + step_scale * d_t * delta_L

Each of these five lines is a line-for-line transcription of an equation we have just derived.

  • Lines (1) and (2) are EDM’s stochastic churn: t^i=ti(1+γi)\hat{t}_i = t_i(1+\gamma_i) followed by x^i=xi+t^i2ti2Snoiseϵ\hat{\mathbf{x}}_i = \mathbf{x}_i + \sqrt{\hat{t}_i^2 – t_i^2}\cdot S_{\text{noise}}\cdot\boldsymbol{\epsilon}. The thresholding γ=γ0\gamma = \gamma_0 if ct>γminc_t > \gamma_\text{min} else 00 is exactly the S_min/S_max window from the EDM paper, deciding where in the schedule to spend the churn budget.
  • Line (3) is the call to the preconditioned denoiser Dθ(x;t^)D_\theta(\mathbf{x}; \hat{t}). The diffusion_module here wraps the inner network with the analytic skip-connection and input/output scalings we derived above.
  • Line (4) is literally the EDM ODE: dx/dt=(xD(x;t))/t\mathrm{d}\mathbf{x}/\mathrm{d}t = (\mathbf{x} – D(\mathbf{x}; t))/t. The variable is even named delta_L and the comment in the source code reads “gradient of x wrt. t at x_t_hat”.
  • Line (5) is the Euler step xi+1=x^i+(ti+1t^i)di\mathbf{x}_{i+1} = \hat{\mathbf{x}}_i + (t_{i+1} – \hat{t}_i)\,\mathbf{d}_i, with an extra step_scale knob that lets the user dial the step magnitude up or down (useful for trading sample diversity against fidelity in protein design).

Two small caveats worth flagging. First, RFDiffusion3 uses a first-order Euler integrator rather than the second-order Heun method recommended in the EDM paper; in practice the protein community has found that the extra denoiser evaluation per step is not worth its cost when the network is already expensive. Second, the step_scale and noise_scale multipliers are protein-design-specific knobs grafted on top of the pure EDM recipe to control downstream metrics like RMSD-to-motif and sample diversity, the kind of practical adjustment one always finds when transplanting a recipe from images to structures.

The point is that essentially every line of the inner sampling loop of a modern de-novo protein-design model is exactly the Karras et al. EDM sampler from Section 4, with the same churn, the same (xD)/t(\mathbf{x} – D)/t tangent direction, and the same preconditioned denoiser at its heart. Everything we have built up in this section is not retrospective taxonomy: it is the code path running every time someone designs a new protein with RFDiffusion3.

5. The Variance Schedule, Up Close

Because every method we have discussed is, mathematically, a particular Gaussian probability path xt=αtx0+σtϵ\mathbf{x}_t = \alpha_t\,\mathbf{x}_0 + \sigma_t\,\boldsymbol{\epsilon}, it is worth pausing to lay out side by side the choices of (αt,σt)(\alpha_t, \sigma_t) that constitute the variance schedule in each framework.

Frameworkαt\alpha_tσt\sigma_tαt2+σt2\alpha_t^2 + \sigma_t^2 (when Var(x0)=1\mathrm{Var}(\mathbf{x}_0) = 1)Boundary at tTt \to T
DDPM / VP, linearαˉt\sqrt{\bar{\alpha}_t}, βt[104,2 ⁣ ⁣102]\beta_t\in[10^{-4},2\!\cdot\!10^{-2}] linear1αˉt\sqrt{1-\bar{\alpha}_t}=1= 1 (preserved)N(0,I)\to\mathcal{N}(\mathbf{0},\mathbf{I})
DDPM / VP, cosineαˉt=cos ⁣(t/T+s1+sπ2) ⁣/ ⁣cos ⁣(s1+sπ2)\sqrt{\bar{\alpha}_t} = \cos\!\bigl(\tfrac{t/T+s}{1+s}\cdot\tfrac{\pi}{2}\bigr)\!/\!\cos\!\bigl(\tfrac{s}{1+s}\cdot\tfrac{\pi}{2}\bigr)1αˉt\sqrt{1-\bar{\alpha}_t}=1= 1 (preserved)N(0,I)\to\mathcal{N}(\mathbf{0},\mathbf{I})
NCSN / VE11σ(t)\sigma(t), geometric or polynomial1+σt21 + \sigma_t^2 (exploding)N(x0,σmax2I)\to\mathcal{N}(\mathbf{x}_0,\sigma_{\max}^2\mathbf{I})
sub-VPαˉt\sqrt{\bar{\alpha}_t}1αˉt1-\bar{\alpha}_t (note: no square root)<1< 1 at intermediate ttN(0,I)\to\mathcal{N}(\mathbf{0},\mathbf{I})
EDM11σ[0.002,80]\sigma\in[0.002,80] via ρ=7\rho=7 power law1+σ21 + \sigma^2 (VE-like)truncated at σmax\sigma_{\max}, not \infty

A few comments worth internalising. Linear vs cosine is a small but consequential change: Nichol & Dhariwal (2021) noted that the linear schedule destroys the signal far too aggressively in the early-to-middle timesteps for low-resolution images, leaving the network nothing to learn during the late part of training; the cosine schedule keeps αˉt\bar{\alpha}_t near 1 for longer and then collapses smoothly. VP vs VE is a more fundamental geometric choice: under VP the trajectory stays inside a unit ball, under VE it expands outward. EDM is essentially VE plus principled preconditioning, which restores numerical stability and lets the network see well-scaled inputs at every σ\sigma. Finally, sub-VP sacrifices a small amount of sample quality for a sharper concentration around x0\mathbf{x}_0 at intermediate tt, improving likelihood.

6. Connections, Tradeoffs, and Where the Field Sits

It is now possible to draw the picture cleanly. Every method we have discussed corresponds to a choice of three things:

  1. A probability path {pt}\{p_t\} interpolating noise and data, defined by (αt,σt)(\alpha_t, \sigma_t). VP, VE, sub-VP, and EDM-VE are different paths.
  2. A regression target (ϵ\boldsymbol{\epsilon}, x0\mathbf{x}_0, or score logpt\nabla\log p_t), all of which are related by exact closed-form linear transformations once the path and prior are fixed.
  3. A sampler, either an SDE solver (DDPM ancestral, reverse-SDE Euler–Maruyama, EDM-stochastic), a deterministic ODE solver (DDIM, Heun, RK4), or a hybrid (predictor–corrector, EDM-churn).

The historical narrative is essentially that successive papers separated these three choices and showed that mixing them productively gives better samples. DDPM coupled VP, ϵ\epsilon-prediction, and ancestral SDE sampling. DDIM kept VP and ϵ\epsilon but swapped the sampler for an ODE. Song et al. (2021b) showed all of these are discretisations of a continuous SDE and that the same model admits both stochastic and deterministic samplers, and introduced VE and sub-VP. EDM then showed that with the right preconditioning and ODE solver, the difference between VP and VE essentially vanishes and most empirical gains come from sampler quality and noise-level distribution at training.

The stochastic vs deterministic tradeoff is worth spelling out. Stochastic samplers (DDPM, reverse-SDE, EDM with churn) act as their own error correctors: noise injection at intermediate steps washes out accumulated drift error from the score network, often producing better FID at high NFE than the corresponding deterministic ODE solver on the same model. Deterministic samplers (DDIM, Heun on the PF-ODE) are faster per quality at low NFE, are invertible (useful for editing, latent interpolation, likelihood evaluation), and are reproducible. Karras et al. (2022) systematically characterise this crossover; in my own protein-design work I tend to use the deterministic Heun sampler for NFE budgets under ~30 and switch to stochastic-churn variants when I have NFE to spare or am worried about mode coverage.

Comparison of sampling methods

MethodYearKey ideaSampling typeProsWhen to use
DDPM (Ho et al.)2020Reverse Markov chain, ϵ\epsilon-prediction, linear β\betaStochastic (ancestral)Simple, foundational, robust trainingReference implementations; teaching
DDIM (Song et al.)2021Non-Markovian forward, η=0\eta=0 deterministicDeterministic (or interpolated)10 to 50 times fewer steps than DDPM, invertible, latent interpolationFast inference with an existing DDPM-trained model
NCSN / VE-SDE (Song & Ermon; Song et al.)2019 / 2021Score matching at multiple noise scales; Langevin samplingStochastic (Langevin / SDE)Strong likelihoods; PF-ODE for exact densityHigh-noise regimes; likelihood-critical applications
VP-SDE / sub-VP (Song et al.)2021Continuous-time DDPM via SDE; PC sampler; PF-ODEBothUnified framework; principled likelihood; PC samplersWhen you want the SDE formalism (e.g. inverse problems, guidance)
EDM (Karras et al.)2022σ\sigma-parameterisation, preconditioning, Heun + churnBothBest FID per NFE; clean code; modularModern default for image / structure diffusion; my default in protein work

The current state of the art in protein-structure generation is, at the time of writing in May 2026, a mixture of these ingredients. Backbone-generation models like RFDiffusion descend directly from the score-SDE / EDM lineage. The point I would press on a student starting the field now is that the training objectives across all of these are nearly identical regression losses on linear reparameterisations of the same quantity (noise, clean data, or score), and what really differs between methods is the path, the sampler, and the symmetry group baked into the network. Pick the path that gives the geometry you want; pick the sampler that matches your NFE budget; train the same kind of MSE in every case.

Where to look next

The story above stops short of an important and very active branch of the field: flow matching and its relatives (rectified flow, OT-CFM). These methods drop the diffusion construction entirely and learn the velocity field of an ODE that transports noise to data along arbitrary, often straight-line, probability paths. They have become dominant in protein-structure generation since roughly 2023 and deserve their own dedicated treatment. For a broader and excellent overview of this family I recommend the MIT 6.S979 lecture notes on diffusion and flow-based generative models, which cover flow matching, conditional flow matching, and the connections back to the score-based picture we have built up here.

References

  1. Anderson, B. D. O. (1982). Reverse-time diffusion equation models. Stochastic Processes and Their Applications 12(3), 313–326.
  2. Feller, W. (1949). On the theory of stochastic processes, with particular reference to applications. Proceedings of the (First) Berkeley Symposium on Mathematical Statistics and Probability, 403–432.
  3. Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020. arXiv:2006.11239.
  4. Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research 6, 695–709.
  5. Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the Design Space of Diffusion-Based Generative Models. NeurIPS 2022. arXiv:2206.00364.
  6. MIT 6.S979 (2026). Lecture notes on diffusion and flow-based generative models.
  7. Nichol, A., & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML 2021. arXiv:2102.09672.
  8. Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015. arXiv:1503.03585.
  9. Song, J., Meng, C., & Ermon, S. (2021a). Denoising Diffusion Implicit Models. ICLR 2021. arXiv:2010.02502.
  10. Song, Y., & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019. arXiv:1907.05600.
  11. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021b). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021. arXiv:2011.13456.
  12. Vincent, P. (2011). A Connection Between Score Matching and Denoising Autoencoders. Neural Computation 23(7), 1661–1674.
  13. Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., Ahern, W., Borst, A. J., Ragotte, R. J., Milles, L. F., et al. (2023). De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100.

Author