A timeline of sampling methods of diffusion models | Oxford Protein Informatics Group

When approaching the methods used in de-novo protein design, one is quickly confronted with a plethora of overlapping formulations of what looks superficially like “the same thing”. One paper trains an $\boldsymbol{\epsilon}$ -prediction network with a simple MSE loss; another trains a score network with a stochastic-differential-equation justification; a third trains a clean-data predictor under yet another schedule. Each formulation carries its own notation, its own variance schedule, and its own sampler. Qualitatively, this zoo of formulations is doing the same thing: it starts from some unstructured noise and iteratively refines it to eventually produce a protein structure similar (but different!) to other proteins we have experimentally determined in the past. What is not immediately obvious to a newcomer is that all of these formulations are historical descendants of a small number of foundational ideas, and that essentially every architectural and algorithmic decision in a modern protein-design diffusion model has a specific paper of origin and a specific motivation for being there.

This post is my attempt to put these formulations onto a single timeline. I trace the trajectory of the field through four foundational works: DDPM (Ho et al., 2020), DDIM (Song et al., 2021a), the score-based SDE unification (Song et al., 2021b), and EDM (Karras et al., 2022), explaining at each step what specific problem with the previous formulation the next paper was attacking and how the new formulation generalises or simplifies the old one. The goal is coherent motivation rather than exhaustive coverage; the reader interested in implementation details is referred to the original papers and the references at the end.

The problem we want to solve

Before diving into any specific method, let us be precise about the problem all of them are trying to solve. As mentioned above, we have a target distribution $p_{\text{data}}$ on $\mathbb{R}^d$ we are interested in sampling from. For our purposes, think the distribution over plausible protein backbones, side-chain configurations, or full structures. We cannot evaluate $p_{\text{data}}$ analytically and we cannot sample from it directly; what we have is a finite dataset of samples $\{\mathbf{x}^{(i)}\}_{i=1}^{N} \sim p_{\text{data}}$ . We want a generative model that produces fresh samples from $p_{\text{data}}$ at inference time.

The strategy that diffusion and score-based models all share is the same: introduce a tractable prior $p_{\text{prior}}$ (typically an isotropic Gaussian $\mathcal{N}(\mathbf{0}, \mathbf{I})$ ) that we can sample from trivially, and learn a map that transforms samples from $p_{\text{prior}}$ into samples from $p_{\text{data}}$ . The methods differ in how they construct this map: as the reverse of a noising Markov chain (DDPM), or as the time-integral of a reverse-time stochastic differential equation (score SDE), with the deterministic probability-flow ODE as a natural sibling. But the goal is identical, and the underlying object (a parameterised, time-dependent transport from a tractable prior to an intractable data distribution) is identical across all of them.

A useful equation to keep in mind throughout is the Gaussian probability path

\mathbf{x}_t \;=\; \alpha_t\,\mathbf{x}_0 \;+\; \sigma_t\,\boldsymbol{\epsilon},\qquad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}),

shared by all the methods we will discuss. The differences between DDPM, DDIM, VP/VE-SDE, and EDM can largely be read off as different choices of $(\alpha_t, \sigma_t)$ and different ways of turning a learnt regression target back into a sample.

1. DDPM: Denoising as a Hierarchical VAE

The modern diffusion era begins with Ho, Jain & Abbeel’s “Denoising Diffusion Probabilistic Models” ([Ho et al., 2020](#ref-ho2020)), which built on the variational construction of Sohl-Dickstein et al. (2015) and simplified it into a recipe that could be trained reliably at the scale of CIFAR-10 and LSUN. The contribution is conceptually two-part: a clever choice of variational objective that turns generative modelling into a denoising problem, and an empirical finding that a simplified version of that objective trains better than the principled one.

The forward (noising) process

DDPM defines a fixed, parameter-free Markov chain that gradually corrupts a data point $\mathbf{x}_0 \sim q(\mathbf{x}_0)$ into Gaussian noise over $T$ discrete steps. With a variance schedule $\{\beta_t\}_{t=1}^{T}\subset(0,1)$ ,

q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) \;=\; \mathcal{N}\!\bigl(\mathbf{x}_t;\sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\,\beta_t\mathbf{I}\bigr).

In the standard DDPM the schedule is linear: $\beta_t$ grows linearly from $\beta_1 = 10^{-4}$ to $\beta_T = 2\times 10^{-2}$ over $T = 1000$ steps (Ho et al., 2020).

It is worth pausing to understand what this conditional is doing. The new sample $\mathbf{x}_t$ is drawn from a Gaussian centred at $\sqrt{1-\beta_t}\,\mathbf{x}_{t-1}$ with variance $\beta_t \mathbf{I}$ . Two observations are crucial here:

The mean of $\mathbf{x}_t$ equals the mean of $\mathbf{x}_{t-1}$ multiplied by $\sqrt{1-\beta_t} < 1$ . Iterating this forward, the mean of $\mathbf{x}_t$ conditional on $\mathbf{x}_0$ is $\prod_{s\le t} \sqrt{1-\beta_s}\cdot\mathbf{x}_0$ , a product of factors strictly less than one. Such a product converges to zero as $t\to T$ . In words: the forward process gradually forgets the starting point and pushes the conditional mean toward the origin.
The variance accumulates from zero towards one. A short recursive calculation (below) shows that if $\mathrm{Var}(\mathbf{x}_0) \le 1$ , the conditional variance of $\mathbf{x}_t \mid \mathbf{x}_0$ grows monotonically from $0$ at $t=0$ to a value close to $1$ at $t = T$ .

Putting these together: regardless of where $\mathbf{x}_0$ lives, $q(\mathbf{x}_T \mid \mathbf{x}_0)$ is approximately $\mathcal{N}(\mathbf{0}, \mathbf{I})$ for any reasonable $\beta$ schedule. This is the property that makes the whole construction work.

The crucial algebraic fact about Gaussian Markov chains, and what makes diffusion tractable in the first place, is that the marginal at any step $t$ conditioned on $\mathbf{x}_0$ admits a closed form. Setting $\alpha_t = 1-\beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s$ ,

q(\mathbf{x}_t \mid \mathbf{x}_0) \;=\; \mathcal{N}\!\bigl(\mathbf{x}_t;\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\,(1-\bar{\alpha}_t)\mathbf{I}\bigr) \;\;\Longleftrightarrow\;\; \mathbf{x}_t \;=\; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 \;+\; \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon},\quad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}).

This is the reparameterisation: rather than simulating $t$ forward steps, we can sample $\mathbf{x}_t$ directly from $\mathbf{x}_0$ in one go, using a single Gaussian noise draw and a precomputed $\bar{\alpha}_t$ .

Intuition: The forward process gradually destroys information by adding noise. The signal $\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0$ is rescaled down as $t$ grows, while the noise $\sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}$ is rescaled up; together they keep the total variance at unity (when $\mathrm{Var}(\mathbf{x}_0) = 1$ ). This is what we will later call a variance-preserving path.

It is worth pausing on how nontrivial this is. No matter where you start, a protein structure, an image of a cat, a noisy sketch, applying small Gaussian perturbations enough times always lands you at the same easy-to-sample distribution: an isotropic Gaussian. The forward direction is universal. The generative problem reduces to inverting this “universal contraction”: if we can learn to undo the corruption step-by-step, we can sample new data points by starting at the Gaussian prior and reversing the chain.

A latent-variable view, and why we need a variational bound

Suppose we want to use the reverse direction to draw samples. We define a generative model $p_\theta$ that parameterises a reverse Markov chain,

p_\theta(\mathbf{x}_{0:T}) \;=\; p(\mathbf{x}_T)\,\prod_{t=1}^{T} p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t),

where the prior $p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0},\mathbf{I})$ matches the limiting marginal of the forward process, and each reverse transition is a Gaussian with parameters $(\boldsymbol{\mu}_\theta(\mathbf{x}_t,t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t,t))$ predicted by a neural network. To train, we want to maximise the likelihood $p_\theta(\mathbf{x}_0)$ of the observed data. Why the joint and not just $p_\theta(\mathbf{x}_0)$ directly? Because we cannot evaluate $p_\theta(\mathbf{x}_0)$ in closed form: it is the marginal of the joint over all intermediate states:

p_\theta(\mathbf{x}_0) \;=\; \int p_\theta(\mathbf{x}_{0:T})\,\mathrm{d}\mathbf{x}_{1:T}.

The integral on the right is over $T$ high-dimensional latent variables and is intractable. This is the standard situation in latent-variable models, and the standard fix, due to the VAE literature, is to lower-bound the log-likelihood with the evidence lower bound (ELBO). We introduce a variational distribution $q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)$ over the latents, multiply and divide by it inside the integral, and apply Jensen’s inequality to the resulting log of an expectation:

\log p_\theta(\mathbf{x}_0) \;=\; \log \int q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)\,\frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\,\mathrm{d}\mathbf{x}_{1:T} \;\ge\; \mathbb{E}_{q}\!\left[\log\frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\right].

The expectation on the right is the evidence lower bound. Its negative is the variational lower bound loss,

\mathcal{L}_{\text{VLB}} \;=\; -\,\mathbb{E}_{q}\!\left[\log\frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\right],

which we minimise to push the model joint toward the data joint induced by $q$ .

What makes diffusion special compared to a generic VAE is that we fix the variational distribution $q$ to be the known forward process, not learnt, not parameterised, just the simple Gaussian chain we constructed above. This is the conceptual key: a diffusion model is a hierarchical VAE with $T$ latents whose encoder is frozen by design and whose decoder is a learnable reverse Markov chain.

The diagram above summarises the construction we have just walked through: a frozen forward Markov chain $q$ that destroys structure, and a learnable reverse chain $p_\theta$ that reconstructs it.

Decomposing $\mathcal{L}_{\text{VLB}}$

The integral defining $\mathcal{L}_{\text{VLB}}$ is still high-dimensional, but it admits a clean per-step decomposition. Start by expanding the ratio inside the log, using the factorisations of $p_\theta(\mathbf{x}_{0:T})$ and $q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)$ :

\log\frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)} \;=\; \log p(\mathbf{x}_T) + \sum_{t=1}^{T}\log\frac{p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)}{q(\mathbf{x}_t\mid\mathbf{x}_{t-1})}.

The forward conditionals $q(\mathbf{x}_t\mid\mathbf{x}_{t-1})$ are not directly comparable to the reverse conditionals $p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)$ : they go in opposite directions. To make them comparable, rewrite $q(\mathbf{x}_t \mid \mathbf{x}_{t-1})$ for $t \ge 2$ using Bayes’ rule, conditioning everything on $\mathbf{x}_0$ :

q(\mathbf{x}_t\mid\mathbf{x}_{t-1}) \;=\; q(\mathbf{x}_t\mid\mathbf{x}_{t-1},\mathbf{x}_0) \;=\; \frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)\,q(\mathbf{x}_t\mid\mathbf{x}_0)}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}.

The first equality uses the Markov property of the forward chain; the second is Bayes. Substituting back and using a telescoping argument on the marginals (full derivation in [Ho et al., 2020, appendix A](#ref-ho2020)), one obtains

\mathcal{L}_{\text{VLB}} \;=\; \underbrace{D_{\mathrm{KL}}\!\bigl(q(\mathbf{x}_T\mid\mathbf{x}_0)\,\Vert\,p(\mathbf{x}_T)\bigr)}_{L_T} \;+\; \sum_{t=2}^{T}\underbrace{\mathbb{E}_{q}\!\left[D_{\mathrm{KL}}\!\bigl(q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)\,\Vert\,p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)\bigr)\right]}_{L_{t-1}} \;\underbrace{-\,\mathbb{E}_{q}\!\left[\log p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)\right]}_{L_0}.

This decomposition has a beautiful structure. $L_T$ is essentially constant: it measures how close the forward marginal $q(\mathbf{x}_T\mid\mathbf{x}_0)$ is to the prior $p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0},\mathbf{I})$ , and is approximately zero by construction. $L_0$ is a final-step log-likelihood that becomes negligible at high $T$ . The interesting terms are the per-step KL divergences $L_{t-1}$ , which compare the true posterior $q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)$ , a Gaussian with closed form thanks to Bayes’ rule on Gaussians, against the model’s reverse conditional $p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)$ .

The true posterior is

q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) \;=\; \mathcal{N}\bigl(\mathbf{x}_{t-1};\,\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0),\,\tilde{\beta}_t\mathbf{I}\bigr),

with closed-form posterior mean and variance

\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0) \;=\; \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\mathbf{x}_0 \;+\; \frac{\sqrt{\alpha_t}\,(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,\mathbf{x}_t,\qquad \tilde{\beta}_t \;=\; \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\,\beta_t.

Here we can notice how the new mean of the gaussian is an interpolation between the noisy state in which we are at point $t$ and the clean/unnoised sample at time zero!

From KL to MSE on means

A key fact, originally due to Feller (1949), is that when the forward step $q(\mathbf{x}_t \mid \mathbf{x}_{t-1})$ is Gaussian with sufficiently small variance $\beta_t$ , the exact reverse step $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$ is also approximately Gaussian. This is the formal justification for parameterising the model’s reverse conditional as a Gaussian with fixed covariance: $p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1};\,\boldsymbol{\mu}_\theta(\mathbf{x}_t,t),\,\sigma_t^2\mathbf{I})$ with $\sigma_t^2 \in \{\beta_t, \tilde{\beta}_t\}$ . The variance is fixed (not learned) to one of the two endpoints corresponding to the upper and lower bounds on the true reverse variance derived in Ho et al. (2020). The KL divergence between two Gaussians with the same covariance $\sigma_t^2\mathbf{I}$ has a particularly simple form, it reduces to a scaled squared distance between the means:

D_{\mathrm{KL}}\!\bigl(\mathcal{N}(\tilde{\boldsymbol{\mu}}_t,\sigma_t^2\mathbf{I})\,\Vert\,\mathcal{N}(\boldsymbol{\mu}_\theta,\sigma_t^2\mathbf{I})\bigr) \;=\; \frac{1}{2\sigma_t^2}\,\lVert\tilde{\boldsymbol{\mu}}_t – \boldsymbol{\mu}_\theta\rVert^2.

So the entire per-step loss reduces to an MSE between the model’s predicted mean and the true posterior mean. The strategic question is then: how should we parameterise the network’s output to make this MSE easy to learn?

Using the reparameterisation $\mathbf{x}_0 = (\mathbf{x}_t – \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon})/\sqrt{\bar{\alpha}_t}$ , that is, expressing $\mathbf{x}_0$ in terms of the noise $\boldsymbol{\epsilon}$ that was used to construct $\mathbf{x}_t$ , and substituting into $\tilde{\boldsymbol{\mu}}_t$ , mechanical algebra gives

\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\boldsymbol{\epsilon}) \;=\; \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t – \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}\right).

This suggests parameterising the model’s mean by the same expression but with a learnt $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)$ in place of the true $\boldsymbol{\epsilon}$ :

\boldsymbol{\mu}_\theta(\mathbf{x}_t,t) \;=\; \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t – \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right).

Substituting $\tilde{\boldsymbol{\mu}}_t$ and $\boldsymbol{\mu}_\theta$ into the KL expression and simplifying gives, after cancellation,

L_{t-1} \;=\; \frac{\beta_t^2}{2\,\sigma_t^2\,\alpha_t\,(1-\bar{\alpha}_t)}\,\mathbb{E}_{\mathbf{x}_0,\boldsymbol{\epsilon}}\!\left[\lVert\boldsymbol{\epsilon} – \boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\rVert^2\right].

Each per-step VLB term has thus reduced to a weighted MSE between the true noise (drawn when we constructed $\mathbf{x}_t$ ) and the network’s predicted noise. This is, structurally, vanilla supervised regression: predict $\boldsymbol{\epsilon}$ given $\mathbf{x}_t$ .

Ho et al. then made the empirical observation that dropping the time-dependent weight in front of this MSE produces strictly better samples than keeping the principled VLB weight. The resulting simple objective is the workhorse of modern diffusion:

\mathcal{L}_{\text{simple}}(\theta) \;=\; \mathbb{E}_{t,\mathbf{x}_0,\boldsymbol{\epsilon}}\bigl[\,\lVert \boldsymbol{\epsilon} – \boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\rVert^2\,\bigr],\qquad \mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}.

The dropped weighting is largest for small $t$ (mild noise) and effectively upweights the harder high-noise levels, which empirically gives sharper samples (Ho et al., 2020).

This objective is extraordinarily efficient and scalable, and this is, in my view, the single most important reason diffusion models took off so quickly. Look at what training requires at each step: pick a data point $\mathbf{x}_0$ from the training set, sample a timestep $t$ uniformly, sample a single Gaussian noise vector $\boldsymbol{\epsilon}$ , form $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}$ using precomputed $\bar{\alpha}_t$ tables, and regress with one network forward-backward pass. There is no need to simulate the trajectory. There is no MCMC inner loop, no ODE solver in training, no need to even store any state between iterations. The training cost per step is identical to that of a vanilla supervised regressor. This is the property that made diffusion scale from CIFAR-10 to ImageNet to text-conditional generation to protein design without much architectural rethinking.

Ancestral sampling

At inference time we want new samples from the trained model, so we must run the Markov chain backward. Start from $\mathbf{x}_T\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ and for $t = T, T-1, \dots, 1$ apply the ancestral sampler:

\mathbf{x}_{t-1} \;=\; \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t – \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right) \;+\; \sigma_t\,\mathbf{z},\qquad \mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}),\;\;\sigma_t^2\in\{\beta_t,\tilde{\beta}_t\}.

We can observe here as well how the new less noisy sample at time $t-1$ is indeed our sample at time $t$ minus the noise that the networks recognizes in the sample. Heuristically, we are doing a step towards the denoised sample (away from the noise).

A later refinement worth flagging here: [Nichol & Dhariwal (2021)](#ref-nichol2021) showed that a cosine schedule,

\bar{\alpha}_t \;=\; \frac{f(t)}{f(0)},\qquad f(t) \;=\; \cos^2\!\Bigl(\tfrac{t/T+s}{1+s}\cdot\tfrac{\pi}{2}\Bigr),\qquad s = 0.008,

avoids the pathology of the linear schedule destroying low-resolution signal too quickly; $\bar{\alpha}_t$ stays near 1 for longer in the early steps and then collapses smoothly. They also showed that learning a log-interpolation between $\beta_t$ and $\tilde{\beta}_t$ for the reverse covariance improves log-likelihood. Both choices are now standard.

DDPM in the wild: the original RFDiffusion sampler

The original RFDiffusion (Watson et al., 2023), one of the model that launched de-novo protein design as we know it today, is in this DDPM lineage. Its inner sampling loop in rfdiffusion/inference/utils.py implements exactly the ancestral update we have just derived. The key helper, get_mu_xt_x0, computes the posterior mean $\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)$ and posterior variance $\tilde{\beta}_t$ from cached $\beta$ and $\bar{\alpha}$ schedules:

def get_mu_xt_x0(xt, px0, t, beta_schedule, alphabar_schedule, eps=1e-6):
    """Given xt, predicted x0 and the timestep t, give mu of x(t-1)."""
    t_idx = t - 1
    # (1) Posterior variance: sigma = ((1 - alphabar_{t-1}) / (1 - alphabar_t)) * beta_t
    #     This is exactly tilde{beta}_t.
    sigma = (
        (1 - alphabar_schedule[t_idx - 1]) / (1 - alphabar_schedule[t_idx])
    ) * beta_schedule[t_idx]
    xt_ca  = xt[:, 1, :]    # C-alpha coordinates at time t
    px0_ca = px0[:, 1, :]   # network's prediction of clean C-alpha coordinates
    # (2) First term of the posterior mean: coefficient on x_0.
    a = (
        (torch.sqrt(alphabar_schedule[t_idx - 1] + eps) * beta_schedule[t_idx])
        / (1 - alphabar_schedule[t_idx])
    ) * px0_ca
    # (3) Second term of the posterior mean: coefficient on x_t,
    #     with sqrt(1 - beta_t) = sqrt(alpha_t).
    b = (
        (torch.sqrt(1 - beta_schedule[t_idx] + eps) * (1 - alphabar_schedule[t_idx - 1]))
        / (1 - alphabar_schedule[t_idx])
    ) * xt_ca
    mu = a + b
    return mu, sigma

And the helper that drives one ancestral step, get_next_ca, samples $\mathbf{x}_{t-1}$ from the Gaussian $\mathcal{N}(\tilde{\boldsymbol{\mu}}_t, \tilde{\beta}_t\mathbf{I})$ in one line:

def get_next_ca(xt, px0, t, ..., noise_scale=1.0):
    # Compute posterior mean and variance from the cached schedules.
    mu, sigma = get_mu_xt_x0(xt, px0, t, beta_schedule, alphabar_schedule)
    # (4) Ancestral sampling: x_{t-1} ~ N(mu, sigma * I).
    sampled_crds = torch.normal(mu, torch.sqrt(sigma * noise_scale))
    delta = sampled_crds - xt[:, 1, :]
    ...
    out_crds = xt + delta[:, None, :]
    return out_crds / crd_scale, delta / crd_scale

Each numbered line is a line-for-line transcription of an equation from Section 1.

Lines (1) is the closed-form DDPM posterior variance $\tilde{\beta}_t = \tfrac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\,\beta_t$ , with alphabar_schedule and beta_schedule precomputed at startup.
Lines (2) and (3) are the two terms of the closed-form posterior mean
$\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) \;=\; \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\mathbf{x}_0 \;+\; \frac{\sqrt{\alpha_t}\,(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,\mathbf{x}_t,$
with $\sqrt{\alpha_t}$ written out as $\sqrt{1-\beta_t}$ and $\mathbf{x}_0$ replaced by the network’s prediction px0 (an $\mathbf{x}_0$ -parameterisation rather than $\boldsymbol{\epsilon}$ -parameterisation, but as we will see in Section 2 the two are linear transformations of each other in this Gaussians setting).
Line (4) is the ancestral sampling step itself: draw $\mathbf{x}_{t-1} \sim \mathcal{N}(\tilde{\boldsymbol{\mu}}_t,\,\tilde{\beta}_t\mathbf{I})$ , exactly the per-step update from Section 1. The noise_scale knob is a protein-design-specific multiplier that lets the user trade off sample diversity against fidelity to the training distribution; setting noise_scale = 1.0 recovers the textbook DDPM sampler.

One little note to keep in mind here: RFDiffusion uses this Gaussian DDPM update only on the translational (C-alpha) part of the structure: rotations are diffused on $SO(3)$ with an IGSO(3) reverse process (the get_next_frames helper alongside get_next_ca in the same file), because the Gaussian formalism does not transfer cleanly to a curved manifold.

2. DDIM: Non-Markovian Generalisation and Deterministic Sampling

As mentioned above, the DDPM formulation allows for highly efficient training that can be parallelised and therefore scaled. At sampling time, however, the Markov structure of the process forces us to start from the tractable distribution at $T$ and traverse every step of the learned chain in reverse. This meant that even with a perfectly trained model one needed hundreds to thousands of network evaluations to produce a single sample. Song, Meng & Ermon (2021a) asked therefore the following question: does the DDPM training objective actually require the forward process to be Markov?

The answer is no, and the consequence is a sampler that converges in 20–50 steps instead of 1000.

A first observation: noise-prediction is denoised-image-prediction

Before deriving DDIM, it is worth absorbing the equivalence between two natural parameterisations of the network’s output. The forward marginal is

\mathbf{x}_t \;=\; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 \;+\; \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}.

This is a linear relationship between three quantities ( $\mathbf{x}_0$ , $\mathbf{x}_t$ , $\boldsymbol{\epsilon}$ ). Given any two of them we can solve for the third:

\mathbf{x}_0 \;=\; \frac{\mathbf{x}_t – \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}}{\sqrt{\bar{\alpha}_t}},\qquad \boldsymbol{\epsilon} \;=\; \frac{\mathbf{x}_t – \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0}{\sqrt{1-\bar{\alpha}_t}}.

So a network that takes $\mathbf{x}_t$ as input and predicts $\boldsymbol{\epsilon}$ is equivalent to a network that predicts $\mathbf{x}_0$ : each prediction is a fixed linear transformation of the other (with coefficients that depend on $t$ but not on the data). In particular, if the network outputs $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)$ we can immediately extract a denoised-image estimate

\hat{\mathbf{x}}_0(\mathbf{x}_t,t) \;=\; \frac{\mathbf{x}_t – \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)}{\sqrt{\bar{\alpha}_t}}.

This $\hat{\mathbf{x}}_0$ is essentially the Tweedie-formula one-step denoiser: the network’s best guess of the clean image given the noisy observation. We will use it heavily in what follows.

Two equivalent ways to think about a denoising step. Both parameterisations describe exactly the same operation, but they invite different mental pictures.

Predict the clean image, then take a small step toward it. Feed $\mathbf{x}_t$ to the network and ask “what does the clean image $\mathbf{x}_0$ look like?” Recall from the discussion of the true posterior mean above that one step of denoising is essentially an interpolation between the noisy state $\mathbf{x}_t$ and the predicted clean state $\hat{\mathbf{x}}_0$ , with the interpolation weight controlled by the schedule. So a step of the reverse chain literally drags the sample a little closer to where the network thinks the data lives.
Predict the noise, then subtract a small fraction of it. Equivalently, ask “what noise was added to produce $\mathbf{x}_t$ ?” and remove a small portion of it, taking a step away from the noise direction. The two views are linked by the linear relation $\mathbf{x}_0 = (\mathbf{x}_t – \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon})/\sqrt{\bar{\alpha}_t}$ , so the two updates are mathematically identical.

The DDPM paper picked $\boldsymbol{\epsilon}$ -prediction because the regression target has unit variance by construction (the network is asked to predict an $\mathcal{N}(\mathbf{0}, \mathbf{I})$ vector at every $t$ ), which makes the network well-conditioned across the entire schedule. Different downstream methods will switch between parameterisations as convenient.

A family of non-Markovian forward processes

The DDPM loss $\mathcal{L}_{\text{simple}}$ depends only on the marginals $q(\mathbf{x}_t\mid\mathbf{x}_0)$ : at each step we sample one $\mathbf{x}_t$ from this marginal and regress. It does not depend on the joint $q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)$ in any way that involves the inter-step structure. Song et al. (2021a) leverage this: they consider an entire family of (possibly non-Markovian) forward processes that share the same per-step marginals as DDPM but differ in their posterior conditionals $q_\sigma(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)$ . Each member of this family corresponds to a valid sampler that re-uses the same trained noise predictor $\boldsymbol{\epsilon}_\theta$ , no retraining needed.

The family is indexed by a sequence of stochasticity parameters $\{\sigma_t\}$ and posits

q_\sigma(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) \;=\; \mathcal{N}\!\Bigl(\mathbf{x}_{t-1};\,\sqrt{\bar{\alpha}_{t-1}}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\cdot\tfrac{\mathbf{x}_t – \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0}{\sqrt{1-\bar{\alpha}_t}},\,\sigma_t^2\mathbf{I}\Bigr).

The structure of this conditional is worth absorbing. The mean is a sum of two terms:

$\sqrt{\bar{\alpha}_{t-1}}\,\mathbf{x}_0$ : the signal component at noise level $t-1$ , i.e. where $\mathbf{x}_0$ would project under the forward marginal at time $t-1$ ;
$\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\cdot\tfrac{\mathbf{x}_t – \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0}{\sqrt{1-\bar{\alpha}_t}}$ : a scaled version of the noise direction $\boldsymbol{\epsilon}$ implied by the pair $(\mathbf{x}_t, \mathbf{x}_0)$ .

Plus a Gaussian fluctuation of variance $\sigma_t^2 \mathbf{I}$ . Song et al. (2021a) prove that, for any choice of $\{\sigma_t\}$ , this family yields the same marginals $q(\mathbf{x}_t \mid \mathbf{x}_0)$ as DDPM, and therefore the same training objective and the same trained network are valid.

The DDIM sampler

Plugging in the network’s prediction $\hat{\mathbf{x}}_0 = (\mathbf{x}_t – \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}_\theta)/\sqrt{\bar{\alpha}_t}$ in place of the true $\mathbf{x}_0$ , we obtain the DDIM update in a very transparent form:

\mathbf{x}_{t-1} \;=\; \underbrace{\sqrt{\bar{\alpha}_{t-1}}\,\hat{\mathbf{x}}_0}_{\text{signal at level $t-1$}} \;+\; \underbrace{\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\,\boldsymbol{\epsilon}_\theta}_{\text{noise direction, scaled}} \;+\; \underbrace{\sigma_t\,\mathbf{z}}_{\text{fresh randomness}},\qquad \mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}).

Compare directly with the DDPM ancestral update,

\mathbf{x}_{t-1} \;=\; \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t – \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta\right) \;+\; \sigma_t^{\text{DDPM}}\,\mathbf{z},

and the structural difference becomes clear. The DDPM update is a local perturbation of $\mathbf{x}_t$ whose coefficients $1/\sqrt{\alpha_t}$ and $\beta_t/\sqrt{1-\bar{\alpha}_t}$ are calibrated for a single $\Delta t = 1$ Markov step — the step size is baked into the formula. The DDIM update instead reconstructs $\mathbf{x}_{t-1}$ from scratch by combining the predicted clean image $\hat{\mathbf{x}}_0$ (rescaled to the target noise level via $\sqrt{\bar{\alpha}_{t-1}}$ ) with the predicted noise direction (rescaled by $\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}$ ). The target time index $t-1$ enters the right-hand side only through the cumulative noise level $\bar{\alpha}_{t-1}$ ; the local step quantities $\beta_t$ and $\alpha_t$ do not appear explicitly at all. (The source time $t$ still enters, of course, through $\hat{\mathbf{x}}_0$ and $\boldsymbol{\epsilon}_\theta$ , which depend on $\mathbf{x}_t$ and $\bar{\alpha}_t$ .)

The stochasticity parameter $\sigma_t$ is conventionally written

\sigma_t(\eta) \;=\; \eta\,\sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\,\Bigl(1-\tfrac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}\Bigr)},

with $\eta\in[0,1]$ interpolating between two limits:

$\eta = 1$ recovers the DDPM ancestral sampler (up to a covariance choice equivalent to $\tilde{\beta}_t$ );
$\eta = 0$ gives the deterministic DDIM sampler, the workhorse of the modern fast-sampling regime:

\mathbf{x}_{t-1} \;=\; \sqrt{\bar{\alpha}_{t-1}}\,\hat{\mathbf{x}}_0(\mathbf{x}_t,t) \;+\; \sqrt{1-\bar{\alpha}_{t-1}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t).

Why DDIM allows fewer sampling steps

It is tempting to attribute DDIM’s step-skipping ability simply to “the trajectory is smooth.” That is part of the story but not the whole picture. Three observations, each with a clean intuitive content, together explain the effect.

(i) The update is anchored to a noise-level-agnostic prediction. Look again at the DDIM update: the right-hand side combines $\hat{\mathbf{x}}_0$ (the network’s guess of the clean image) and $\boldsymbol{\epsilon}_\theta$ (the noise it sees) with coefficients $\sqrt{\bar{\alpha}_{s}}$ and $\sqrt{1-\bar{\alpha}_{s}-\sigma_s^2}$ for any target noise level $s$ — and in the deterministic ( $\eta = 0$ ) case, this collapses to the clean form $\sqrt{\bar{\alpha}_{s}}$ and $\sqrt{1-\bar{\alpha}_{s}}$ displayed above. Crucially, $\hat{\mathbf{x}}_0$ does not depend on which $s$ we want to jump to, it just depends on the current $\mathbf{x}_t$ . So once the network has produced its best guess of the clean image, the formula can reassemble a sample at any lower noise level by simply remixing signal and noise in the proportions appropriate to that level. We can leap directly from $t = 1000$ to $t = 500$ without visiting any intermediate state. (Song et al. formalise this in Section 4.2 / Appendix C.1 of the paper: the same update applies along any sub-sequence $\tau \subset \{1,\dots,T\}$ , replacing $(\bar{\alpha}_t, \bar{\alpha}_{t-1})$ with $(\bar{\alpha}_{\tau_i}, \bar{\alpha}_{\tau_{i-1}})$ .) Contrast this with the DDPM update, whose right-hand side contains $\beta_t/\sqrt{1-\bar{\alpha}_t}$ . That coefficient encodes the size of a single small Markov step: it tells the sampler how much progress to make in one $\Delta t = 1$ increment. There is no analogous “leap to $s$ ” knob, because the DDPM step size is hard-coded into the formula itself.

(ii) DDIM is a first-order discretisation of an ODE, not an SDE. In the deterministic $\eta = 0$ case there is no fresh noise injection at each step, so the only source of error is the truncation error of the ODE solver, i.e. the discrepancy between the discrete update and the continuous probability-flow trajectory we will discuss in Section 3. Truncation error scales smoothly with step size and accumulates additively. SDE discretisations are different: each step injects fresh Gaussian noise whose variance scales with the step size, so larger SDE steps mean noisier samples in a way that no amount of network accuracy can fix. ODE solvers do not suffer from this.

(iii) The probability-flow trajectories tend to be relatively low-curvature. Karras et al. (2022) later showed empirically that the deterministic trajectory from prior to data in diffusion models is approximately linear over much of its length, with most of the curvature concentrated near the data manifold. A nearly-straight trajectory is well-approximated by a few large Euler steps.

Together these three points explain the empirical observation that DDIM with 20–50 steps matches DDPM with 1000 steps at no cost to model retraining. DDIM also gives an invertible encoding from noise to data, which is the substrate of latent-space interpolation, editing, and exact likelihood evaluation via the change-of-variables formula.

Intuition: DDIM keeps the same marginals as DDPM but parameterises the reverse process around $\hat{\mathbf{x}}_0$ , the predicted clean image, rather than around $\mathbf{x}_t$ . Each step says “given my best guess of the clean image, where would I be at noise level $t-1$ ?” This is invariant to step size: the formula is happy to jump to any target noise level. Removing the stochasticity ( $\eta = 0$ ) on top of this turns sampling into a deterministic trajectory you can traverse with bigger strides.

3. Score-Based Generative Modelling Through SDEs

A few months after DDIM, [Song, Sohl-Dickstein, Kingma, Kumar, Ermon & Poole (2021b)](#ref-song2021b) published the paper that, in retrospect, unified everything that came before. Two parallel threads had been developing in the literature:

The DDPM thread I described above, descending from variational hierarchical-VAE constructions and trained with the $\epsilon$ -prediction MSE.
The NCSN / SMLD thread initiated by [Song & Ermon (2019)](#ref-songermon2019), “Noise Conditional Score Network” and “Score Matching with Langevin Dynamics”, which trained a network $\mathbf{s}_\theta(\mathbf{x}, \sigma) \approx \nabla_\mathbf{x}\log p_\sigma(\mathbf{x})$ to estimate the score function of a noise-corrupted version of the data distribution at many noise scales $\sigma_1 > \sigma_2 > \dots > \sigma_L$ . To sample, NCSN ran annealed Langevin dynamics: starting from a wide Gaussian, take some Langevin steps at the largest $\sigma$ , anneal to the next smaller $\sigma$ , repeat.

The training objective for NCSN was denoising score matching (Hyvärinen, 2005; Vincent, 2011), and the sampler was a discrete-time Langevin chain. Song et al. (2021b) showed that both DDPM and NCSN are discretisations of the same continuous-time stochastic differential equation. This unification opened the door to better samplers, exact likelihood evaluation via the probability-flow ODE, and a host of downstream methods including guidance and inverse-problem solvers.

Forward SDE: setup and three regimes

The forward corruption is now an Itô stochastic differential equation on $t \in [0, T]$ :

\mathrm{d}\mathbf{x} \;=\; \mathbf{f}(\mathbf{x},t)\,\mathrm{d}t \;+\; g(t)\,\mathrm{d}\mathbf{w},

where $\mathbf{w}$ is a standard $d$ -dimensional Wiener process (Brownian motion), $\mathbf{f}(\mathbf{x},t)$ is the drift coefficient, and $g(t)$ is the diffusion coefficient. Three concrete choices of $(\mathbf{f}, g)$ recover existing models:

(a) Variance Preserving (VP-SDE), the continuous-time limit of DDPM:

\mathrm{d}\mathbf{x} \;=\; -\tfrac{1}{2}\beta(t)\,\mathbf{x}\,\mathrm{d}t \;+\; \sqrt{\beta(t)}\,\mathrm{d}\mathbf{w}.

This is an inhomogeneous Ornstein–Uhlenbeck (OU) process: a stochastic process whose drift is a linear restoring force pulling $\mathbf{x}$ toward the origin (with strength $\tfrac{1}{2}\beta(t)$ ), and whose diffusion term injects isotropic noise (of strength $\sqrt{\beta(t)}$ ). The “pull toward the origin” can be read directly off the drift coefficient $-\tfrac{1}{2}\beta(t)\,\mathbf{x}$ : it always points in the direction opposite to the current state $\mathbf{x}$ , with magnitude proportional to the distance from the origin. So if $\mathbf{x}$ sits far out along some axis, the deterministic part of the dynamics pushes it back toward zero; if $\mathbf{x}$ is already near the origin, the drift is small. This is the continuous-time analogue of the discrete forward step $\mathbf{x}_t = \sqrt{1-\beta_t}\,\mathbf{x}_{t-1} + \dots$ from Section 1, where the factor $\sqrt{1-\beta_t} < 1$ was shrinking the mean toward zero at each step. The asymptotic stationary distribution as $t \to \infty$ is $\mathcal{N}(\mathbf{0}, \mathbf{I})$ , the Gaussian prior we want.

(b) Variance Exploding (VE-SDE), the continuous-time limit of NCSN / SMLD:

\mathrm{d}\mathbf{x} \;=\; \sqrt{\tfrac{\mathrm{d}\,\sigma^2(t)}{\mathrm{d}t}}\,\mathrm{d}\mathbf{w}.

Notice: no drift. The data is never rescaled; only noise is added, at a rate determined by the schedule $\sigma(t)$ . The marginal is $\mathbf{x}_t = \mathbf{x}_0 + \sigma(t)\,\boldsymbol{\epsilon}$ , and as $\sigma(t) \to \infty$ the marginal becomes a very wide Gaussian centred at $\mathbf{x}_0$ : the variance literally explodes. The oppposite process will therefore look like a “huge gaussian collapsing” into a new generated sample

(c) sub-VP-SDE, introduced in the same paper:

\mathrm{d}\mathbf{x} \;=\; -\tfrac{1}{2}\beta(t)\,\mathbf{x}\,\mathrm{d}t \;+\; \sqrt{\beta(t)\bigl(1-e^{-2\!\int_0^t\beta(s)\,\mathrm{d}s}\bigr)}\,\mathrm{d}\mathbf{w}.

This is a hybrid with variance strictly less than the VP-SDE at every $t$ , empirically giving better likelihoods on CIFAR-10. We will not dwell on it but it’s worth knowing it exists.

DDPM is a discretisation of the VP-SDE

The claim that DDPM is the VP-SDE in disguise is worth working out, because the algebra is short and clarifying. Discretise the VP-SDE with the Euler–Maruyama scheme, the simplest first-order numerical scheme for SDEs, using a step $\Delta t$ :

\mathbf{x}_{t+\Delta t} – \mathbf{x}_t \;\approx\; -\tfrac{1}{2}\beta(t)\,\mathbf{x}_t\,\Delta t \;+\; \sqrt{\beta(t)\,\Delta t}\,\boldsymbol{\epsilon},

\mathbf{x}_{t+\Delta t} \;\approx\; \bigl(1 – \tfrac{1}{2}\beta(t)\,\Delta t\bigr)\,\mathbf{x}_t \;+\; \sqrt{\beta(t)\,\Delta t}\,\boldsymbol{\epsilon}.

Now identify the discrete DDPM coefficients with the continuous SDE coefficients via $\beta_t^{\text{disc}} = \beta(t)\,\Delta t$ (i.e. the discrete $\beta_t$ in DDPM is $\beta(t)$ times the time-step). For small $\beta_t$ we have the Taylor expansion $\sqrt{1 – \beta_t} \approx 1 – \tfrac{1}{2}\beta_t$ , so

\bigl(1 – \tfrac{1}{2}\beta_t^{\text{disc}}\bigr)\,\mathbf{x}_t \;\approx\; \sqrt{1 – \beta_t^{\text{disc}}}\,\mathbf{x}_t.

The Euler–Maruyama discretisation thus reads

\mathbf{x}_{t+\Delta t} \;\approx\; \sqrt{1 – \beta_t^{\text{disc}}}\,\mathbf{x}_t \;+\; \sqrt{\beta_t^{\text{disc}}}\,\boldsymbol{\epsilon},

which is exactly the DDPM forward step $\mathbf{x}_t = \sqrt{\alpha_t}\,\mathbf{x}_{t-1} + \sqrt{\beta_t}\,\boldsymbol{\epsilon}$ from Section 1. So DDPM is literally the Euler–Maruyama discretisation of the VP-SDE with $\beta_t^{\text{disc}} = \beta(t)\,\Delta t$ .

Why is VP-SDE “variance preserving”?

Take the scalar case for clarity and compute the conditional variance recursively. From the forward step $\mathbf{x}_t = \sqrt{1-\beta_t}\,\mathbf{x}_{t-1} + \sqrt{\beta_t}\,\boldsymbol{\epsilon}$ with $\boldsymbol{\epsilon}\sim\mathcal{N}(0,1)$ independent of $\mathbf{x}_{t-1}$ :

\mathrm{Var}(\mathbf{x}_t) \;=\; (1-\beta_t)\,\mathrm{Var}(\mathbf{x}_{t-1}) \;+\; \beta_t.

If $\mathrm{Var}(\mathbf{x}_{t-1}) = 1$ , then $\mathrm{Var}(\mathbf{x}_t) = (1-\beta_t)\cdot 1 + \beta_t = 1$ . So once the variance reaches one, it stays exactly one, independently of the schedule $\{\beta_t\}$ . This is the precise sense in which the VP-SDE preserves variance: the marginal variance is a fixed point of the forward dynamics at value 1. The forward process keeps every $\mathbf{x}_t$ in roughly the same magnitude range as the data, which is convenient numerically.

How VE-SDE is the continuous limit of NCSN/SMLD

Now to the variance-exploding side. NCSN/SMLD as originally constructed (Song & Ermon, 2019) trains a network at $L$ discrete noise scales $\sigma_1 > \sigma_2 > \dots > \sigma_L$ : at scale $\sigma_i$ the “data” is $\mathbf{x} + \sigma_i\boldsymbol{\epsilon}$ , with $\mathbf{x}$ drawn from the dataset. The network outputs an estimate of the score $\nabla_\mathbf{x}\log p_{\sigma_i}(\mathbf{x})$ at each scale, where $p_\sigma$ denotes the data distribution convolved with $\mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$ .

To pass to the continuous-time version, let the discrete scales become a continuous schedule $\sigma(t)$ and ask: which SDE produces marginals $\mathbf{x}_t = \mathbf{x}_0 + \sigma(t)\,\boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ ?

Since the marginal mean is $\mathbf{x}_0$ (constant in $t$ ), the drift must be zero: $\mathbf{f}(\mathbf{x},t) = 0$ . The marginal variance is $\sigma^2(t)$ , and for an SDE with zero drift and diffusion $g(t)$ , the variance evolves as

\frac{\mathrm{d}}{\mathrm{d}t}\mathrm{Var}(\mathbf{x}_t) \;=\; g(t)^2.

Setting $\mathrm{Var}(\mathbf{x}_t) = \sigma^2(t)$ gives $g(t)^2 = \tfrac{\mathrm{d}\sigma^2(t)}{\mathrm{d}t}$ , hence

g(t) \;=\; \sqrt{\frac{\mathrm{d}\,\sigma^2(t)}{\mathrm{d}t}}.

This is exactly the VE-SDE coefficient. The corresponding SDE is

\mathrm{d}\mathbf{x} \;=\; \sqrt{\tfrac{\mathrm{d}\sigma^2(t)}{\mathrm{d}t}}\,\mathrm{d}\mathbf{w},

with marginal $\mathbf{x}_t = \mathbf{x}_0 + \sigma(t)\,\boldsymbol{\epsilon}$ , exactly the family of noise-corrupted distributions that NCSN trains on. So VE-SDE is the continuous schedule limit of NCSN’s discrete noise-scale construction: in the same way DDPM is the continuous limit of VP, NCSN is the continuous limit of VE.

The term “variance exploding” refers to the divergence of $\sigma^2(t)$ at large $t$ : in standard NCSN setups $\sigma$ is geometric with $\sigma_{\max}$ in the dozens or hundreds, so the marginal $\mathbf{x}_t$ at late time has variance much greater than 1. In contrast, VP keeps variance bounded at 1.

Numerical example (VP vs VE at matched SNR)

The cleanest way to see the geometric difference between VP and VE is to match them at the same signal-to-noise ratio (SNR). For each, with $\sigma_{\text{data}} = 1$ :

\mathrm{SNR}_{\text{VP}}(t) \;=\; \frac{\bar{\alpha}_t}{1-\bar{\alpha}_t},\qquad \mathrm{SNR}_{\text{VE}}(\sigma) \;=\; \frac{1}{\sigma^2}.

Setting these equal gives the conversion $\sigma_{\text{VE}} = \sqrt{(1-\bar{\alpha})/\bar{\alpha}}$ . Picking three noise levels and applying the same $\mathbf{x}_0 = 1.0$ , $\boldsymbol{\epsilon} = 0.5$ :

Noise level (SNR)	$\bar{\alpha}$	$\sigma_{\text{VE}}$	$\mathbf{x}_t$ under VP	$\mathbf{x}_t$ under VE	$\mathrm{Var}(\mathbf{x}_t\mid \mathbf{x}_0)$ VP	$\mathrm{Var}(\mathbf{x}_t\mid \mathbf{x}_0)$ VE
Low (SNR=9)	0.9	0.333	$\sqrt{0.9}\cdot 1 + \sqrt{0.1}\cdot 0.5 = 1.107$	$1 + 0.333\cdot 0.5 = 1.167$	0.10	0.111
Mid (SNR=1)	0.5	1.000	$\sqrt{0.5}\cdot 1 + \sqrt{0.5}\cdot 0.5 = 1.061$	$1 + 1.000\cdot 0.5 = 1.500$	0.50	1.000
High (SNR=1/9)	0.1	3.000	$\sqrt{0.1}\cdot 1 + \sqrt{0.9}\cdot 0.5 = 0.791$	$1 + 3.000\cdot 0.5 = 2.500$	0.90	9.000

Reading the table left-to-right: at every row, VP and VE encode the same amount of information about $\mathbf{x}_0$ (same SNR), but they live on completely different scales. VP keeps the magnitude of $\mathbf{x}_t$ bounded near 1 across all noise levels. VE lets the magnitude grow without bound, at high noise the data is a vanishing perturbation of a wide Gaussian. Same $\boldsymbol{\epsilon}$ , same information content, drastically different geometry.

Reverse-time SDE and the score function

The fundamental theoretical result of Song et al. (2021b) is the reversibility of the forward SDE. Due to Anderson (1982), every Itô diffusion admits a reverse-time companion whose drift is the original drift minus the score of the time-marginal:

\mathrm{d}\mathbf{x} \;=\; \bigl[\,\mathbf{f}(\mathbf{x},t) \,-\, g(t)^2\,\nabla_{\!\mathbf{x}}\log p_t(\mathbf{x})\,\bigr]\,\mathrm{d}t \;+\; g(t)\,\mathrm{d}\bar{\mathbf{w}},

where $\bar{\mathbf{w}}$ is a reverse-time Brownian motion. If we can estimate the score $\nabla_\mathbf{x}\log p_t(\mathbf{x})$ at every $t$ (call this estimate $\mathbf{s}_\theta(\mathbf{x},t)$ ), we can simulate the reverse-time SDE numerically (Euler–Maruyama again) and produce a sample from $p_0 \approx p_{\text{data}}$ starting from $\mathbf{x}_T\sim p_T$ .

The score is learned by denoising score matching (Vincent, 2011). The loss is

\mathcal{L}_{\text{DSM}}(\theta) \;=\; \mathbb{E}_{t,\,\mathbf{x}_0,\,\boldsymbol{\epsilon}}\Bigl[\lambda(t)\,\bigl\lVert\mathbf{s}_\theta(\mathbf{x}_t,t) – \nabla_{\!\mathbf{x}_t}\log p_t(\mathbf{x}_t\mid\mathbf{x}_0)\bigr\rVert^2\Bigr].

Because $p_t(\mathbf{x}_t\mid\mathbf{x}_0)$ is a Gaussian with closed form, the conditional score is analytic. For VP at time $t$ , $p_t(\mathbf{x}_t\mid\mathbf{x}_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})$ , so

\nabla_{\!\mathbf{x}_t}\log p_t(\mathbf{x}_t\mid\mathbf{x}_0) \;=\; -\,\frac{\mathbf{x}_t – \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0}{1-\bar{\alpha}_t} \;=\; -\,\frac{\boldsymbol{\epsilon}}{\sqrt{1-\bar{\alpha}_t}}.

This is the exact mathematical bridge to DDPM: predicting the noise $\boldsymbol{\epsilon}$ and predicting the score $\nabla_{\!\mathbf{x}_t}\log p_t$ are the same regression problem up to a fixed scalar $-1/\sqrt{1-\bar{\alpha}_t}$ . The DDPM simple loss and the NCSN denoising score matching loss differ only in the constant in front of the MSE, which is absorbed into the weighting function $\lambda(t)$ .

Probability-flow ODE: same marginals, no noise

A deep observation from Song et al. (2021b): the family of stochastic processes that share the same marginals $p_t$ as the reverse SDE is not unique. There is a privileged deterministic member, the probability-flow ODE:

\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} \;=\; \mathbf{f}(\mathbf{x},t) \;-\; \tfrac{1}{2}\,g(t)^2\,\nabla_{\!\mathbf{x}}\log p_t(\mathbf{x}).

Solving this ODE backward in time from $\mathbf{x}_T\sim p_T$ to $\mathbf{x}_0$ produces a sample with the same marginal distribution at every $t$ as the reverse SDE, but along a deterministic, invertible trajectory. The probability-flow ODE:

underlies DDIM, which is (up to a notational change) its first-order Euler discretisation in the VP regime;
enables exact log-likelihoods through the instantaneous change-of-variables formula $\partial_t \log p_t(\mathbf{x}_t) = -\nabla\cdot\bigl(\mathrm{d}\mathbf{x}/\mathrm{d}t\bigr)$ ;
is the substrate on which EDM will later build.

Intuition (score as force). The score $\nabla_\mathbf{x}\log p_t(\mathbf{x})$ is the gradient of the log-density: it points uphill in $p_t$ , toward higher-density regions, i.e. toward the data manifold. In the reverse SDE, the term $-g(t)^2\,\nabla\log p_t$ in the drift is therefore an attractive force pulling the sample toward the data manifold. The Brownian term $g(t)\,\mathrm{d}\bar{\mathbf{w}}$ is exploratory: it injects random fluctuations that prevent the sample from collapsing onto any particular trajectory. The balance between these two, attraction toward data versus stochastic exploration, is exactly the exploration–exploitation tradeoff familiar from Langevin MCMC and from many sampling algorithms in statistical physics. The probability-flow ODE keeps the attraction and drops the exploration entirely, giving a deterministic trajectory that is the “expected” reverse path.

Predictor–corrector sampling and Langevin correctors

A practical contribution of Song et al. (2021b) that I find under-appreciated is predictor–corrector (PC) sampling. The motivation is that any numerical SDE solver makes errors at each step: the discretised sample at time $t-\Delta t$ has a distribution that drifts from the true marginal $p_{t-\Delta t}$ . Over many steps these errors compound. The idea of PC is to correct the sample’s distribution back onto the true $p_{t-\Delta t}$ after each predictor step by running a small number of MCMC steps targeting $p_{t-\Delta t}$ , using the same learnt score we already have.

The corrector of choice is Langevin MCMC (a.k.a. unadjusted Langevin algorithm or ULA). To sample from a density $p$ , Langevin dynamics is the SDE

\mathrm{d}\mathbf{x} \;=\; \tfrac{1}{2}\nabla\!\log p(\mathbf{x})\,\mathrm{d}t \;+\; \mathrm{d}\mathbf{w}.

Its stationary distribution is $p$ . Discretising with the Euler–Maruyama scheme and a step size $\delta > 0$ ,

\mathbf{x} \;\leftarrow\; \mathbf{x} + \tfrac{\delta}{2}\,\nabla\!\log p(\mathbf{x}) \;+\; \sqrt{\delta}\,\mathbf{z},\qquad \mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}).

Here $\delta$ is the Langevin step size, a hyperparameter that controls how aggressively the corrector moves toward higher-density regions per Langevin step. (It is not the SDE timestep $\Delta t$ , it is the discretisation parameter of a different SDE that runs at fixed outer time.) A typical PC sampling iteration looks like:

Predictor: One reverse-SDE step from $\mathbf{x}$ at time $t$ to $\mathbf{x}^{\text{pred}}$ at time $t – \Delta t$ (e.g. Euler–Maruyama with the learnt score).
Corrector: $K$ Langevin steps at fixed time $t – \Delta t$ targeting $p_{t-\Delta t}$ :
$\mathbf{x} \leftarrow \mathbf{x} + \tfrac{\delta}{2}\,\mathbf{s}_\theta(\mathbf{x},\,t-\Delta t) + \sqrt{\delta}\,\mathbf{z}.$

The number of Langevin steps $K$ and the step size $\delta$ are tuned per problem; Song et al. (2021b) typically use $K = 1$ and a small $\delta$ , choosing it so that the “signal-to-noise ratio” of the Langevin update matches a target value. PC samplers gave a noticeable quality boost over plain Euler–Maruyama at the same total number of function evaluations, and the principle, interleave deterministic transport with score-driven stochastic correction, reappears in EDM’s stochastic sampler.

Why this looks familiar to anyone who has run a molecular dynamics simulation. Langevin dynamics is exactly the equation of motion used in MD with a stochastic thermostat: under the identification $\log p(\mathbf{x}) \leftrightarrow -U(\mathbf{x})/k_B T$ (with $U$ a potential energy and $T$ a temperature), the term $\nabla\!\log p(\mathbf{x})$ becomes the negative gradient of a potential, i.e. a force. So the Langevin step

$\mathbf{x} \;\leftarrow\; \mathbf{x} + \tfrac{\delta}{2}\,\nabla\!\log p(\mathbf{x}) + \sqrt{\delta}\,\mathbf{z}$

reads, in MD language, as “take a small step in the direction of the force, and add a thermal kick.” Just as MD with a Langevin thermostat samples the Boltzmann distribution $e^{-U/k_B T}$ at long times, our corrector samples the target distribution $p$ at long times. The score $\mathbf{s}_\theta(\mathbf{x},t)$ played by the neural network is, by analogy, a learnt force field: a position-dependent vector field whose flow lines lead the sampler toward high-density regions of $p_t$ . The thermal noise $\sqrt{\delta}\,\mathbf{z}$ prevents the sample from collapsing into a single energy minimum (a single mode of $p$ ), exactly as it does in MD at finite temperature. This is a useful mental picture when you want to reason about ergodicity, mode mixing, or step-size choices in the sampler: most of the intuition built up in computational statistical mechanics transfers directly.

4. EDM: Elucidating (and Simplifying) the Design Space

By 2022 the diffusion literature had accumulated a small zoo of choices (VP vs VE, $\epsilon$ – vs score- vs $\mathbf{x}_0$ -prediction, linear vs cosine schedule, ancestral vs DDIM vs PC) and it was not clear which combinations mattered for sample quality and which were essentially cosmetic. Karras, Aittala, Aila & Laine (2022), in “Elucidating the Design Space of Diffusion-Based Generative Models” (EDM), attacked this head-on by re-deriving everything in a single, opinionated parameterisation and then ablating each design choice independently. The result is what has become the default in modern image-diffusion and protein-diffusion codebases.

Source. The exposition of the EDM sampler in this section follows closely the excellent talk by the EDM authors on the design space of diffusion samplers. I highly recommend watching it in full: it is the clearest single source I know of for the geometric intuition behind the $\sigma$ -parameterisation, preconditioning, and Heun-plus-churn sampling.

The “x-ray view”: decoupling the design choices

The framing the EDM authors themselves use is worth absorbing. Before EDM, methods like VP, VE, iDDPM, and DDIM looked like tightly coupled packages you had to take as a whole: you couldn’t, say, combine the iDDPM noise schedule with the DDIM sampler without something breaking, because each package’s choices were implicitly tangled. EDM gives an x-ray view into the internals: it shows that any such method decomposes into a small set of orthogonal design choices,

the ODE / SDE solver (Euler, Heun, RK45, …),
the time-step discretisation (where to place the $N$ sampling points along $\sigma$ ),
the noise schedule $\sigma(t)$ (how $\sigma$ grows with the abstract time variable),
the signal scaling $s(t)$ (whether and how $\mathbf{x}_t$ is rescaled),
the network preconditioning ( $c_{\text{skip}}, c_{\text{out}}, c_{\text{in}}, c_{\text{noise}}$ ),
the training noise distribution $p(\sigma)$ ,
the loss weighting $\lambda(\sigma)$ ,

each of which can be tuned in isolation, without retraining the network and without affecting the validity of the other choices. Armed with this decomposition, Karras et al. systematically searched for the best individual setting of each axis and stacked the improvements.

Sampling and training can be mixed and matched freely. An immediately useful practical consequence of the decoupling above: the training-time choices (preconditioning, loss weighting, noise distribution) and the sampling-time choices (solver, time-step schedule, churn) live on independent axes. You can take a network trained with the iDDPM noise distribution and weighting, and run it with the EDM Heun sampler at inference, no retraining required. Indeed, much of the empirical force of the EDM paper comes from precisely this experiment: Karras et al. apply their improved sampler to off-the-shelf pre-trained VP, VE, and iDDPM models and obtain large quality gains without touching the weights. The corollary is that sampler design and training design can be iterated on independently, a huge practical win, because re-training a diffusion model is expensive while changing the sampler is essentially free.

Before going through their choices it helps to name the two distinct error sources in sampling, which we want to analyse in isolation:

Network (denoiser) error: even if numerical integration were exact, the denoiser $D_\theta$ is only an approximation of the optimal denoiser, so it pushes the trajectory in a slightly wrong direction at each step.
Truncation error: the deterministic trajectory we want to integrate is a curve in $(\mathbf{x},\sigma)$ -space, and any solver discretises it with straight segments. Larger steps mean larger curvature-induced deviations from the true curve.

These two sources interact (worse network → larger drift → harder for the solver), but the sampler design controls primarily the second, while the training design controls primarily the first. EDM separates them cleanly.

The $\sigma$ -parameterisation

EDM throws out the discrete time index $t$ and works directly with the noise level $\sigma$ . The forward path is the simplest possible:

\mathbf{x}(\sigma) \;=\; \mathbf{x}_0 \;+\; \sigma\,\boldsymbol{\epsilon},\qquad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}),

i.e. a variance-exploding path with $\sigma$ playing both the role of “time” and “noise scale.” Karras et al. pick $\sigma(t) = t$ so that $\mathrm{d}\sigma = \mathrm{d}t$ and no signal-scaling $s(t)\ne 1$ is applied. Their justification, which I think is the single deepest insight of the paper, is given in the next subsection.

Two ways to deal with the signal-magnitude problem

A practical concern with the VE path is that for large $\sigma$ , the magnitude of $\mathbf{x} = \mathbf{x}_0 + \sigma\boldsymbol{\epsilon}$ becomes huge, values in the tens or hundreds, well outside the comfortable unit-scale range that neural networks like to operate in. Naively feeding $\mathbf{x}$ at $\sigma = 80$ to a CNN is asking for unstable training.

There are exactly two principled cures for this, and they correspond to the two camps in the pre-EDM literature:

Cure 1 (VP-style): rescale the signal inside the forward path. Introduce an explicit signal scaling $s(t)$ so that the actual sample fed to the network is $s(t)\mathbf{x}(t)$ , with $s(t)$ chosen to keep the magnitude of $s(t)\mathbf{x}(t)$ bounded. This is what VP does implicitly: setting $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}$ is exactly a rescaled version of the VE path, with $s(t) = \sqrt{\bar{\alpha}_t}$ squeezing the signal down to keep total variance at 1.

Cure 2 (VE + network preconditioning): leave the path alone, rescale at the network’s input and output. Keep the simple VE path $\mathbf{x}(\sigma) = \mathbf{x}_0 + \sigma\boldsymbol{\epsilon}$ in its raw form, but wrap the neural network so that its input is divided by $\sqrt{\sigma^2 + \sigma_{\text{data}}^2}$ (giving unit-variance input) and its output is multiplied by an appropriate $\sigma$ -dependent factor. From the network’s point of view it always sees well-scaled tensors, even though the underlying ODE state $\mathbf{x}$ may have huge magnitude.

Both cures solve the same numerical problem and let the network see well-scaled tensors throughout training and sampling. But they have very different geometric consequences for the underlying ODE trajectory. The VP rescaling distorts the natural flow lines of the ODE: it bends what would be a near-linear trajectory in $\mathbf{x}$ -space into a curved one. The VE-plus-preconditioning approach leaves the ODE alone (the flow lines retain whatever natural geometry they have) and absorbs the multi-scale problem inside the network. EDM argues, and demonstrates empirically, that the second cure is strictly better: straighter flow lines mean fewer Euler steps are needed for the same quality. This is why EDM commits to $s(t) = 1$ (no signal scaling in the ODE) and pushes all the magnitude management into preconditioning.

Deriving the EDM ODE from the forward path

The probability-flow ODE for a general SDE was given in Section 3. For the VE-SDE with $\mathbf{f} = 0$ and $g(t) = \sqrt{\mathrm{d}\sigma^2(t)/\mathrm{d}t}$ it becomes

\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} \;=\; -\,\tfrac{1}{2}\,\frac{\mathrm{d}\sigma^2(t)}{\mathrm{d}t}\,\nabla_{\!\mathbf{x}}\log p_t(\mathbf{x}).

Now invoke Tweedie’s formula: for $\mathbf{x} = \mathbf{x}_0 + \sigma\boldsymbol{\epsilon}$ with $\mathbf{x}_0\sim p_{\text{data}}$ ,

\mathbb{E}\bigl[\mathbf{x}_0\,\big|\,\mathbf{x}\bigr] \;=\; \mathbf{x} \;+\; \sigma^2\,\nabla_{\!\mathbf{x}}\log p_\sigma(\mathbf{x}),

equivalently $\nabla_{\!\mathbf{x}}\log p_\sigma(\mathbf{x}) = (D(\mathbf{x};\sigma) – \mathbf{x})/\sigma^2$ , where $D(\mathbf{x};\sigma) := \mathbb{E}[\mathbf{x}_0\mid\mathbf{x}]$ is the optimal MMSE denoiser. MMSE stands for minimum mean squared error: among all measurable functions $f(\mathbf{x})$ predicting $\mathbf{x}_0$ from a noisy observation $\mathbf{x}$ , the conditional expectation $\mathbb{E}[\mathbf{x}_0\mid\mathbf{x}]$ is the unique minimiser of the expected squared error $\mathbb{E}\lVert f(\mathbf{x}) – \mathbf{x}_0\rVert^2$ . So $D(\mathbf{x};\sigma)$ is precisely the object that a network trained with an MSE loss would converge to in the infinite-data limit: predict the clean image given the noisy one. Tweedie’s formula tells us this denoiser and the score are two views of the same underlying quantity.

Substituting the Tweedie expression and using $\sigma(t) = t$ so that $\mathrm{d}\sigma^2/\mathrm{d}t = 2t$ :

\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} \;=\; -\tfrac{1}{2}\cdot 2t\cdot\frac{D(\mathbf{x};t) – \mathbf{x}}{t^2} \;=\; \frac{\mathbf{x} – D(\mathbf{x};t)}{t}.

This is the EDM ODE, the most economical form one can write for the deterministic generative dynamics. The sign convention deserves a careful word, because at first glance the formula looks like it points the wrong way.

First, a clarification that is worth being explicit about. The forward noising introduced at the start of this section, $\mathbf{x}(\sigma) = \mathbf{x}_0 + \sigma\boldsymbol{\epsilon}$ , is the marginal of the stochastic VE-SDE $\mathrm{d}\mathbf{x} = g(t)\,\mathrm{d}\mathbf{w}$ . The equation $\mathrm{d}\mathbf{x}/\mathrm{d}t = (\mathbf{x}-D)/t$ we just derived is a different equation: it is the probability-flow ODE, a deterministic dynamical system whose only relation to the noising SDE is that it shares the same marginal distributions $p_\sigma$ at each noise level. They are not the same trajectory and not the same equation; they only agree on the one-time marginals.

The defining property of the PF-ODE, stated explicitly by Karras et al. (2022) (Section 2), is that integrating it from $t_a$ to $t_b$ — either forward or backward in $t$ — produces a sample distributed according to the marginal $p_{\sigma(t_b)}$ . Because the PF-ODE is deterministic, its solutions can be traversed in either direction of $t$ , and Karras et al. give the intuition in two complementary sentences: “an infinitesimal forward step of this ODE nudges the sample away from the data… equivalently, a backward step nudges the sample towards the data distribution.” So the same PF-ODE can be used either to corrupt (integrate with $\Delta t > 0$ , toward larger noise) or to generate (integrate with $\Delta t < 0$ , toward smaller noise).

Now interpret the formula in each direction. Read with increasing $t$ (increasing noise level $\sigma = t$ ), the velocity vector $(\mathbf{x}-D)/t$ points in the direction $\mathbf{x} – D$ , i.e. away from the denoiser’s estimate of the clean image — consistent with the corruption story (sample drifts away from the data manifold as $\sigma$ grows). To generate samples we integrate with $\Delta t < 0$ , so a single Euler step is $\mathbf{x} \leftarrow \mathbf{x} + \Delta t \cdot (\mathbf{x}-D)/t = \mathbf{x} – |\Delta t|\cdot(\mathbf{x}-D)/t$ , which is a step in the direction $D – \mathbf{x}$ , toward the current best guess of the clean image. The sampler does the intuitive thing.

The apparent singularity at $t \to 0$ also deserves a closer look. The numerator $\mathbf{x} – D(\mathbf{x};t)$ and the denominator $t$ both vanish: as $t \to 0$ the optimal denoiser becomes the identity ( $D(\mathbf{x};0) = \mathbf{x}$ ), so $\mathbf{x} – D(\mathbf{x};t) \to 0$ . The two zeros race each other and the resulting limit is finite. To see this concretely, expand $D$ to first order in $t$ : for small $t$ , $D(\mathbf{x};t) \approx \mathbf{x} – t^2\,\nabla\!\log p_t(\mathbf{x})$ (this is just Tweedie’s formula again, rearranged), so

$\frac{\mathbf{x} – D(\mathbf{x};t)}{t} \;\approx\; t\,\nabla\!\log p_t(\mathbf{x}) \;\xrightarrow[t \to 0]{}\; 0.$

The velocity therefore vanishes as $t \to 0$ , and the trajectory smoothly settles onto the data manifold rather than blowing up.

The tangent points directly to the denoiser output: why this matters. Suppose at time $t$ we extrapolate a single linear Euler step from $\mathbf{x}(t)$ all the way down to $t = 0$ . Plugging into $\mathrm{d}\mathbf{x}/\mathrm{d}t = (\mathbf{x} – D)/t$ ,

\mathbf{x}(0) \;\approx\; \mathbf{x}(t) + (0 – t)\cdot\frac{\mathbf{x}(t) – D(\mathbf{x}(t);t)}{t} \;=\; D(\mathbf{x}(t);\,t).

So with the $\sigma(t)=t$ schedule, the tangent direction points exactly to the current denoiser output. This is not true for other schedules: for VP, or for any choice with signal-rescaling $s(t) \ne 1$ , the tangent points to some weird intermediate target that depends on the schedule’s curvature. Empirically, the denoiser output $D(\mathbf{x}(t);t)$ , the model’s best guess of the clean image, changes very slowly along the sampling trajectory. Once you have an approximately right guess of the clean image, that guess remains approximately right as you anneal $\sigma$ downward. So the tangent direction is roughly constant across steps, which is exactly the condition under which large Euler steps work well. This is the deepest answer to “why does DDIM-style ODE sampling take such big strides?”: the dynamics really want this schedule.

Preconditioning: deriving the EDM coefficients

The single most empirically impactful contribution of EDM is the explicit preconditioning of the network. Rather than asking a raw network $F_\theta$ to output $\boldsymbol{\epsilon}$ or $\mathbf{x}_0$ across many orders of magnitude of $\sigma$ , Karras et al. write

D_\theta(\mathbf{x};\sigma) \;=\; c_{\text{skip}}(\sigma)\,\mathbf{x} \;+\; c_{\text{out}}(\sigma)\,F_\theta\bigl(c_{\text{in}}(\sigma)\,\mathbf{x},\,c_{\text{noise}}(\sigma)\bigr),

and derive the coefficients $c_{\text{skip}}, c_{\text{out}}, c_{\text{in}}, c_{\text{noise}}$ analytically from two simple desiderata: the network’s input should have unit variance, and the network’s training target should have unit variance. Let me walk through the derivation.

Assume $\mathbf{x}_0$ has variance $\sigma_{\text{data}}^2$ per coordinate (typically $\sigma_{\text{data}} = 0.5$ for normalised images). Then $\mathbf{x} = \mathbf{x}_0 + \sigma\boldsymbol{\epsilon}$ has variance $\sigma_{\text{data}}^2 + \sigma^2$ . For the network’s input $c_{\text{in}}(\sigma)\mathbf{x}$ to have unit variance,

c_{\text{in}}(\sigma) \;=\; \frac{1}{\sqrt{\sigma^2 + \sigma_{\text{data}}^2}}.

The training loss is $\mathbb{E}\bigl[\lambda(\sigma)\,\lVert D_\theta(\mathbf{x};\sigma) – \mathbf{x}_0\rVert^2\bigr]$ . Substituting the preconditioned form and rearranging, the network’s effective target is

F_\theta^{\,\star}(\sigma) \;=\; \frac{\mathbf{x}_0 – c_{\text{skip}}(\sigma)\,\mathbf{x}}{c_{\text{out}}(\sigma)}.

We want $\mathrm{Var}(F_\theta^{\,\star}) = 1$ . Expanding the numerator,

\mathbf{x}_0 – c_{\text{skip}}\mathbf{x} \;=\; \mathbf{x}_0 – c_{\text{skip}}(\mathbf{x}_0 + \sigma\boldsymbol{\epsilon}) \;=\; (1-c_{\text{skip}})\,\mathbf{x}_0 \,-\, c_{\text{skip}}\,\sigma\,\boldsymbol{\epsilon},

with variance $(1-c_{\text{skip}})^2 \sigma_{\text{data}}^2 + c_{\text{skip}}^2 \sigma^2$ . To minimise the magnitude of this target across $\sigma$ , equivalently, to give the network the least extreme regression problem, we choose $c_{\text{skip}}$ to minimise this variance. Setting the derivative with respect to $c_{\text{skip}}$ to zero:

-2(1-c_{\text{skip}})\sigma_{\text{data}}^2 + 2\,c_{\text{skip}}\,\sigma^2 = 0 \;\;\Longrightarrow\;\; c_{\text{skip}}(\sigma) \;=\; \frac{\sigma_{\text{data}}^2}{\sigma^2 + \sigma_{\text{data}}^2}.

Substituting back, the residual variance is

(1-c_{\text{skip}})^2 \sigma_{\text{data}}^2 + c_{\text{skip}}^2 \sigma^2 \;=\; \frac{\sigma^2\,\sigma_{\text{data}}^2}{\sigma^2 + \sigma_{\text{data}}^2}.

For unit variance of the network target we need $c_{\text{out}}^2 = \sigma^2\sigma_{\text{data}}^2/(\sigma^2 + \sigma_{\text{data}}^2)$ , hence

c_{\text{out}}(\sigma) \;=\; \frac{\sigma\cdot\sigma_{\text{data}}}{\sqrt{\sigma^2 + \sigma_{\text{data}}^2}}.

Finally, $c_{\text{noise}}$ is the input embedding of the noise level, for which Karras et al. use the heuristic $c_{\text{noise}}(\sigma) = \tfrac{1}{4}\ln(\sigma)$ .

Why this matters in practice: the two extremes. The behaviour of these coefficients at small and large $\sigma$ explains why preconditioning is empirically transformative.

At small $\sigma$ (clean image, $\sigma \ll \sigma_{\text{data}}$ ): $c_{\text{skip}} \approx 1$ and $c_{\text{out}} \approx \sigma$ . The architecture essentially copies the input through the skip, and the network’s output is a small correction scaled down by $\sigma$ . If the network makes an error, that error is multiplied by a small number on its way out, downscaled, not amplified. The network is freed from having to learn the identity function (a real risk in naive setups) and can focus on producing a small denoising correction.
At large $\sigma$ (mostly noise, $\sigma \gg \sigma_{\text{data}}$ ): $c_{\text{skip}} \approx 0$ and $c_{\text{out}} \approx \sigma_{\text{data}}$ . The skip is effectively disabled, and the network is asked to predict the clean signal directly. This avoids the disaster of a naive noise-prediction setup, where the network would predict $-\boldsymbol{\epsilon}$ and have its output multiplied by $\sigma$ , meaning a network error of order $e$ becomes an output error of order $\sigma e$ , amplified by a potentially huge factor when $\sigma$ is in the dozens.

The analytic $c_{\text{skip}}$ smoothly interpolates between these two regimes, automatically selecting “predict residual / predict signal” depending on how noisy the input is. The transition happens around $\sigma \sim \sigma_{\text{data}}$ . The neural network always sees well-scaled inputs and produces well-scaled outputs; the brunt of the multi-scale problem is absorbed by an analytic skip connection.

A further upside the EDM authors emphasise: because the preconditioning analytically handles the scale variation, the architecture of the inner network $F_\theta$ is now essentially free to be whatever you want (DDPM-style U-Net, NCSN-style, transformer, etc.) and the preconditioning works for “any” architecture. EDM cleanly separates “what the network is” from “how its inputs and outputs are scaled.”

The EDM sampler: Heun with stochastic churn

For sampling, EDM proposes a deterministic and a stochastic variant of the same core procedure. The deterministic version is Heun’s second-order ODE solver applied to the EDM ODE: for each step from $t_i$ to $t_{i+1}$ ,

\mathbf{d}_i \;=\; \frac{\mathbf{x}_i – D_\theta(\mathbf{x}_i; t_i)}{t_i},\qquad \tilde{\mathbf{x}}_{i+1} \;=\; \mathbf{x}_i + (t_{i+1}-t_i)\,\mathbf{d}_i,

\mathbf{d}_{i+1}’ \;=\; \frac{\tilde{\mathbf{x}}_{i+1} – D_\theta(\tilde{\mathbf{x}}_{i+1};t_{i+1})}{t_{i+1}},\qquad \mathbf{x}_{i+1} \;=\; \mathbf{x}_i + (t_{i+1}-t_i)\cdot\tfrac{1}{2}\bigl(\mathbf{d}_i + \mathbf{d}_{i+1}’\bigr).

Heun is a predictor–corrector scheme in the numerical-analysis sense: take a tentative Euler step, evaluate the slope at the predicted endpoint, then re-step using the average of the two slopes. The second-order trapezoidal correction halves the truncation error per step compared to plain Euler (or first-order DDIM) at the cost of one extra denoiser evaluation per step. Karras et al. tested various higher-order solvers (RK45, linear multistep methods) and reported that Heun strikes the best quality-per-NFE tradeoff.

The stochastic variant adds Langevin-style churn: at the start of each step, inflate the current sample’s noise level by a small factor, then take a deterministic Heun step from the inflated level back down to $t_{i+1}$ . Concretely:

\gamma_i = \min\!\bigl(S_{\text{churn}}/N,\,\sqrt{2}-1\bigr) \text{ if } S_{\text{min}}\le t_i\le S_{\text{max}}, \text{ else } 0,

\hat{t}_i = t_i(1+\gamma_i),\qquad \hat{\mathbf{x}}_i = \mathbf{x}_i + \sqrt{\hat{t}_i^2 – t_i^2}\cdot S_{\text{noise}}\cdot\boldsymbol{\epsilon},

followed by a Heun step from $(\hat{\mathbf{x}}_i, \hat{t}_i)$ to $(\mathbf{x}_{i+1}, t_{i+1})$ . The hyperparameters $(S_{\text{churn}}, S_{\text{min}}, S_{\text{max}}, S_{\text{noise}})$ control how much, where in the schedule, and with what amplification noise is reinjected. The interpretation is exactly the predictor–corrector logic from Section 3 in a different disguise: the noise injection is the “Langevin exploration” that washes out deterministic drift error; the Heun step is the deterministic transport that follows the ODE flow lines. Putting them together at each step is empirically more controllable than running a full reverse-SDE solver.

The Karras $\sigma$ -schedule: spend more steps near the data

The noise levels themselves are not linear or geometric but follow a power law parameterised by $\rho$ :

\sigma_i \;=\; \Bigl(\sigma_{\max}^{1/\rho} + \tfrac{i}{N-1}\bigl(\sigma_{\min}^{1/\rho} – \sigma_{\max}^{1/\rho}\bigr)\Bigr)^{\rho},\qquad i = 0,\dots,N-1.

The defaults $\sigma_{\min} = 0.002$ , $\sigma_{\max} = 80$ , $\sigma_{\text{data}} = 0.5$ , and $\rho = 7$ allocate disproportionately many steps to the low- $\sigma$ end, where the sample is close to the denoised image. The motivation is twofold:

Close to the data manifold, the ODE trajectory has its highest curvature. Far from the data the dynamics are gentle and approximately linear (recall that the EDM tangent points to the denoiser output, which barely changes). Close to the data the trajectory “locks onto” a specific sample, and the denoiser output begins to commit to fine details; this is where the flow lines turn. Linear Euler steps are most error-prone in regions of high curvature, so we want fine resolution there.
Small errors near the data are perceptually catastrophic. An error of magnitude $e$ at small $\sigma$ produces a visible artifact in the final sample (low $\sigma$ means $\mathbf{x}_t$ is almost the final image, and any deviation will appear in the output as-is). The same error at large $\sigma$ gets washed out by subsequent denoising. Putting more steps where they matter gives much better quality per NFE.

Increasing $\rho$ makes the allocation more lopsided toward small $\sigma$ . The default $\rho = 7$ was found by ablation; “Karras sigmas” with this schedule are now the default in essentially every modern image-diffusion pipeline.

Training: loss weighting and the noise level distribution

EDM’s training prescription is short but important enough that it deserves explicit treatment. Two choices complement the architectural preconditioning.

The noise level distribution during training is log-normal:

\ln(\sigma) \;\sim\; \mathcal{N}(P_{\text{mean}},\,P_{\text{std}}^2),\qquad P_{\text{mean}} = -1.2,\;P_{\text{std}} = 1.2.

The motivation here is the mirror image of the sampling-schedule argument above. Empirically, the per- $\sigma$ training loss has a roughly U-shape: at very low $\sigma$ (the input is almost clean) the denoising task is trivial (any reasonable network produces a near-identity mapping) and there is little gradient signal to learn from. At very high $\sigma$ (the input is mostly noise) the denoising task is essentially hopeless (no signal to recover) and gradients are again uninformative. The sweet spot, where the network can actually make progress and gradients carry real information, sits in the middle of the $\sigma$ range. The log-normal distribution above concentrates training samples in exactly this regime, putting computational effort where the model can actually learn something.

Note that the sampling and training schedules answer different questions. Sampling allocates more steps near small $\sigma$ because that is where truncation error is most damaging. Training allocates more samples near intermediate $\sigma$ because that is where gradient information is richest. The two need not coincide, and in EDM they don’t.

The loss weighting that complements this is

\lambda(\sigma) \;=\; \frac{\sigma^2 + \sigma_{\text{data}}^2}{(\sigma\cdot\sigma_{\text{data}})^2}.

This is chosen to equalise the gradient magnitude across noise levels. Without proper weighting, a naive $\lVert F_\theta – \text{target}\rVert^2$ loss has gradients that vary by orders of magnitude across $\sigma$ , leading to highly imbalanced training: most updates push the network gently while occasional high-magnitude updates dominate. The weighting $\lambda(\sigma)$ , combined with the unit-variance preconditioning, produces gradient updates of comparable magnitude at every $\sigma$ , which empirically gives much more stable training dynamics.

An important observation on stochasticity, paraphrased from the EDM authors’ talk. The stochastic-churn sampler is empirically helpful, but it is a double-edged sword. The Langevin churn introduces its own discretisation error on top of the deterministic Heun step, so the net benefit depends on a delicate balance, and the optimal churn hyperparameters $(S_{\text{churn}}, S_{\text{min}}, S_{\text{max}}, S_{\text{noise}})$ need to be tuned per dataset and per architecture, with no clean principles for how to do this. Worse, churn can mask bugs and modelling errors: a model with a slight bias in its denoiser can still produce reasonable samples under enough exploration, even when its deterministic samples reveal that something is wrong.

The authors’ recommendation, which I have come to follow in my own protein-design work, is: develop and debug in fully deterministic mode (Heun on the PF-ODE), and add stochasticity as “the final cherry on top” only after you are confident the deterministic samples look right. As a calibration point, the EDM CIFAR-10 model benefits negligibly from churn at its trained quality (the network is good enough that any exploration injects more error than it fixes), while their ImageNet-64 model still benefits, a useful empirical pointer that the marginal value of churn diminishes as the score network gets better.

EDM in the wild: the RFDiffusion3 inference sampler

To make the relevance of all this concrete, let us look at how the state-of-the-art protein-structure generator RFDiffusion3 implements its sampler. The relevant code lives in inference_sampler.py in the RosettaCommons foundry repository, in a method tellingly named sample_diffusion_like_af3. Stripping out the protein-specific bookkeeping (motif handling, classifier-free guidance, recycling, logging), the inner loop is essentially this:

for step_num, (c_t_minus_1, c_t) in enumerate(zip(noise_schedule, noise_schedule[1:])):

    # (1) Stochastic churn: inflate the current noise level by (1 + gamma).
    gamma = self.gamma_0 if c_t > self.gamma_min else 0
    t_hat = c_t_minus_1 * (gamma + 1)

    # (2) Inject noise to match the inflated level: epsilon ~ sqrt(t_hat^2 - t^2) * z.
    epsilon_L = (
        self.noise_scale
        * torch.sqrt(torch.square(t_hat) - torch.square(c_t_minus_1))
        * torch.normal(mean=0.0, std=1.0, size=X_L.shape, device=X_L.device)
    )
    X_noisy_L = X_L + epsilon_L

    # (3) Call the preconditioned denoiser D_theta(x; t_hat) on the inflated state.
    X_denoised_L = diffusion_module(X_noisy_L=X_noisy_L, t=t_hat.tile(D), ...)

    # (4) EDM tangent direction: (x - D(x; t_hat)) / t_hat.
    delta_L = (X_noisy_L - X_denoised_L) / t_hat   # "gradient of x wrt. t at x_t_hat"
    d_t = c_t - t_hat

    # (5) First-order Euler step along the EDM ODE.
    X_L = X_noisy_L + step_scale * d_t * delta_L

Each of these five lines is a line-for-line transcription of an equation we have just derived.

Lines (1) and (2) are EDM’s stochastic churn: $\hat{t}_i = t_i(1+\gamma_i)$ followed by $\hat{\mathbf{x}}_i = \mathbf{x}_i + \sqrt{\hat{t}_i^2 – t_i^2}\cdot S_{\text{noise}}\cdot\boldsymbol{\epsilon}$ . The thresholding $\gamma = \gamma_0$ if $c_t > \gamma_\text{min}$ else $0$ is exactly the S_min/S_max window from the EDM paper, deciding where in the schedule to spend the churn budget.
Line (3) is the call to the preconditioned denoiser $D_\theta(\mathbf{x}; \hat{t})$ . The diffusion_module here wraps the inner network with the analytic skip-connection and input/output scalings we derived above.
Line (4) is literally the EDM ODE: $\mathrm{d}\mathbf{x}/\mathrm{d}t = (\mathbf{x} – D(\mathbf{x}; t))/t$ . The variable is even named delta_L and the comment in the source code reads “gradient of x wrt. t at x_t_hat”.
Line (5) is the Euler step $\mathbf{x}_{i+1} = \hat{\mathbf{x}}_i + (t_{i+1} – \hat{t}_i)\,\mathbf{d}_i$ , with an extra step_scale knob that lets the user dial the step magnitude up or down (useful for trading sample diversity against fidelity in protein design).

Two small caveats worth flagging. First, RFDiffusion3 uses a first-order Euler integrator rather than the second-order Heun method recommended in the EDM paper; in practice the protein community has found that the extra denoiser evaluation per step is not worth its cost when the network is already expensive. Second, the step_scale and noise_scale multipliers are protein-design-specific knobs grafted on top of the pure EDM recipe to control downstream metrics like RMSD-to-motif and sample diversity, the kind of practical adjustment one always finds when transplanting a recipe from images to structures.

The point is that essentially every line of the inner sampling loop of a modern de-novo protein-design model is exactly the Karras et al. EDM sampler from Section 4, with the same churn, the same $(\mathbf{x} – D)/t$ tangent direction, and the same preconditioned denoiser at its heart. Everything we have built up in this section is not retrospective taxonomy: it is the code path running every time someone designs a new protein with RFDiffusion3.

5. The Variance Schedule, Up Close

Because every method we have discussed is, mathematically, a particular Gaussian probability path $\mathbf{x}_t = \alpha_t\,\mathbf{x}_0 + \sigma_t\,\boldsymbol{\epsilon}$ , it is worth pausing to lay out side by side the choices of $(\alpha_t, \sigma_t)$ that constitute the variance schedule in each framework.

Framework	$\alpha_t$	$\sigma_t$	$\alpha_t^2 + \sigma_t^2$ (when $\mathrm{Var}(\mathbf{x}_0) = 1$ )	Boundary at $t \to T$
DDPM / VP, linear	$\sqrt{\bar{\alpha}_t}$ , $\beta_t\in[10^{-4},2\!\cdot\!10^{-2}]$ linear	$\sqrt{1-\bar{\alpha}_t}$	$= 1$ (preserved)	$\to\mathcal{N}(\mathbf{0},\mathbf{I})$
DDPM / VP, cosine	$\sqrt{\bar{\alpha}_t} = \cos\!\bigl(\tfrac{t/T+s}{1+s}\cdot\tfrac{\pi}{2}\bigr)\!/\!\cos\!\bigl(\tfrac{s}{1+s}\cdot\tfrac{\pi}{2}\bigr)$	$\sqrt{1-\bar{\alpha}_t}$	$= 1$ (preserved)	$\to\mathcal{N}(\mathbf{0},\mathbf{I})$
NCSN / VE	$1$	$\sigma(t)$ , geometric or polynomial	$1 + \sigma_t^2$ (exploding)	$\to\mathcal{N}(\mathbf{x}_0,\sigma_{\max}^2\mathbf{I})$
sub-VP	$\sqrt{\bar{\alpha}_t}$	$1-\bar{\alpha}_t$ (note: no square root)	$< 1$ at intermediate $t$	$\to\mathcal{N}(\mathbf{0},\mathbf{I})$
EDM	$1$	$\sigma\in[0.002,80]$ via $\rho=7$ power law	$1 + \sigma^2$ (VE-like)	truncated at $\sigma_{\max}$ , not $\infty$

A few comments worth internalising. Linear vs cosine is a small but consequential change: Nichol & Dhariwal (2021) noted that the linear schedule destroys the signal far too aggressively in the early-to-middle timesteps for low-resolution images, leaving the network nothing to learn during the late part of training; the cosine schedule keeps $\bar{\alpha}_t$ near 1 for longer and then collapses smoothly. VP vs VE is a more fundamental geometric choice: under VP the trajectory stays inside a unit ball, under VE it expands outward. EDM is essentially VE plus principled preconditioning, which restores numerical stability and lets the network see well-scaled inputs at every $\sigma$ . Finally, sub-VP sacrifices a small amount of sample quality for a sharper concentration around $\mathbf{x}_0$ at intermediate $t$ , improving likelihood.

6. Connections, Tradeoffs, and Where the Field Sits

It is now possible to draw the picture cleanly. Every method we have discussed corresponds to a choice of three things:

A probability path $\{p_t\}$ interpolating noise and data, defined by $(\alpha_t, \sigma_t)$ . VP, VE, sub-VP, and EDM-VE are different paths.
A regression target ( $\boldsymbol{\epsilon}$ , $\mathbf{x}_0$ , or score $\nabla\log p_t$ ), all of which are related by exact closed-form linear transformations once the path and prior are fixed.
A sampler, either an SDE solver (DDPM ancestral, reverse-SDE Euler–Maruyama, EDM-stochastic), a deterministic ODE solver (DDIM, Heun, RK4), or a hybrid (predictor–corrector, EDM-churn).

The historical narrative is essentially that successive papers separated these three choices and showed that mixing them productively gives better samples. DDPM coupled VP, $\epsilon$ -prediction, and ancestral SDE sampling. DDIM kept VP and $\epsilon$ but swapped the sampler for an ODE. Song et al. (2021b) showed all of these are discretisations of a continuous SDE and that the same model admits both stochastic and deterministic samplers, and introduced VE and sub-VP. EDM then showed that with the right preconditioning and ODE solver, the difference between VP and VE essentially vanishes and most empirical gains come from sampler quality and noise-level distribution at training.

The stochastic vs deterministic tradeoff is worth spelling out. Stochastic samplers (DDPM, reverse-SDE, EDM with churn) act as their own error correctors: noise injection at intermediate steps washes out accumulated drift error from the score network, often producing better FID at high NFE than the corresponding deterministic ODE solver on the same model. Deterministic samplers (DDIM, Heun on the PF-ODE) are faster per quality at low NFE, are invertible (useful for editing, latent interpolation, likelihood evaluation), and are reproducible. Karras et al. (2022) systematically characterise this crossover; in my own protein-design work I tend to use the deterministic Heun sampler for NFE budgets under ~30 and switch to stochastic-churn variants when I have NFE to spare or am worried about mode coverage.

Comparison of sampling methods

Method	Year	Key idea	Sampling type	Pros	When to use
DDPM (Ho et al.)	2020	Reverse Markov chain, $\epsilon$ -prediction, linear $\beta$	Stochastic (ancestral)	Simple, foundational, robust training	Reference implementations; teaching
DDIM (Song et al.)	2021	Non-Markovian forward, $\eta=0$ deterministic	Deterministic (or interpolated)	10 to 50 times fewer steps than DDPM, invertible, latent interpolation	Fast inference with an existing DDPM-trained model
NCSN / VE-SDE (Song & Ermon; Song et al.)	2019 / 2021	Score matching at multiple noise scales; Langevin sampling	Stochastic (Langevin / SDE)	Strong likelihoods; PF-ODE for exact density	High-noise regimes; likelihood-critical applications
VP-SDE / sub-VP (Song et al.)	2021	Continuous-time DDPM via SDE; PC sampler; PF-ODE	Both	Unified framework; principled likelihood; PC samplers	When you want the SDE formalism (e.g. inverse problems, guidance)
EDM (Karras et al.)	2022	$\sigma$ -parameterisation, preconditioning, Heun + churn	Both	Best FID per NFE; clean code; modular	Modern default for image / structure diffusion; my default in protein work

The current state of the art in protein-structure generation is, at the time of writing in May 2026, a mixture of these ingredients. Backbone-generation models like RFDiffusion descend directly from the score-SDE / EDM lineage. The point I would press on a student starting the field now is that the training objectives across all of these are nearly identical regression losses on linear reparameterisations of the same quantity (noise, clean data, or score), and what really differs between methods is the path, the sampler, and the symmetry group baked into the network. Pick the path that gives the geometry you want; pick the sampler that matches your NFE budget; train the same kind of MSE in every case.

Where to look next

The story above stops short of an important and very active branch of the field: flow matching and its relatives (rectified flow, OT-CFM). These methods drop the diffusion construction entirely and learn the velocity field of an ODE that transports noise to data along arbitrary, often straight-line, probability paths. They have become dominant in protein-structure generation since roughly 2023 and deserve their own dedicated treatment. For a broader and excellent overview of this family I recommend the MIT 6.S979 lecture notes on diffusion and flow-based generative models, which cover flow matching, conditional flow matching, and the connections back to the score-based picture we have built up here.

References

Anderson, B. D. O. (1982). Reverse-time diffusion equation models. Stochastic Processes and Their Applications 12(3), 313–326.
Feller, W. (1949). On the theory of stochastic processes, with particular reference to applications. Proceedings of the (First) Berkeley Symposium on Mathematical Statistics and Probability, 403–432.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020. arXiv:2006.11239.
Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research 6, 695–709.
Karras, T., Aittala, M., Aila, T., & Laine, S. (2022). Elucidating the Design Space of Diffusion-Based Generative Models. NeurIPS 2022. arXiv:2206.00364.
MIT 6.S979 (2026). Lecture notes on diffusion and flow-based generative models.
Nichol, A., & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML 2021. arXiv:2102.09672.
Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., & Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015. arXiv:1503.03585.
Song, J., Meng, C., & Ermon, S. (2021a). Denoising Diffusion Implicit Models. ICLR 2021. arXiv:2010.02502.
Song, Y., & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019. arXiv:1907.05600.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021b). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021. arXiv:2011.13456.
Vincent, P. (2011). A Connection Between Score Matching and Denoising Autoencoders. Neural Computation 23(7), 1661–1674.
Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., Ahern, W., Borst, A. J., Ragotte, R. J., Milles, L. F., et al. (2023). De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100.

Author

Lorenzo Tarricone

View all posts