{"id":14282,"date":"2026-05-14T18:15:05","date_gmt":"2026-05-14T17:15:05","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=14282"},"modified":"2026-05-14T18:15:12","modified_gmt":"2026-05-14T17:15:12","slug":"a-timeline-of-sampling-methods-of-diffusion-models","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2026\/05\/a-timeline-of-sampling-methods-of-diffusion-models\/","title":{"rendered":"A timeline of sampling methods of diffusion models"},"content":{"rendered":"\n<p class=\"\">When approaching the methods used in de-novo protein design, one is quickly confronted with a plethora of overlapping formulations of what looks superficially like &#8220;the same thing&#8221;. One paper trains an <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>-prediction network with a <em>simple<\/em> MSE loss; another trains a score network with a stochastic-differential-equation justification; a third trains a clean-data predictor under yet another schedule. Each formulation carries its own notation, its own variance schedule, and its own sampler. Qualitatively, this zoo of formulations is doing the same thing: it starts from some unstructured noise and iteratively refines it to eventually produce a protein structure similar (but different!) to other proteins we have experimentally determined in the past. What is not immediately obvious to a newcomer is that all of these formulations are historical descendants of a small number of foundational ideas, and that essentially every architectural and algorithmic decision in a modern protein-design diffusion model has a specific paper of origin and a specific motivation for being there.<\/p>\n\n\n\n<p class=\"\">This post is my attempt to put these formulations onto a single timeline. I trace the trajectory of the field through four foundational works: DDPM (<a href=\"#ref-ho2020\">Ho et al., 2020<\/a>), DDIM (<a href=\"#ref-song2021a\">Song et al., 2021a<\/a>), the score-based SDE unification (<a href=\"#ref-song2021b\">Song et al., 2021b<\/a>), and EDM (Karras et al., 2022), explaining at each step <em>what specific problem with the previous formulation the next paper was attacking<\/em> and <em>how the new formulation generalises or simplifies the old one<\/em>. The goal is coherent motivation rather than exhaustive coverage; the reader interested in implementation details is referred to the original papers and the references at the end.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The problem we want to solve<\/h3>\n\n\n\n<p class=\"\">Before diving into any specific method, let us be precise about the problem all of them are trying to solve. As mentioned above, we have a <em>target distribution<\/em> <math><semantics><mrow><msub><mi>p<\/mi><mtext>data<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_{\\text{data}}<\/annotation><\/semantics><\/math> on <math><semantics><mrow><msup><mi mathvariant=\"double-struck\">R<\/mi><mi>d<\/mi><\/msup><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbb{R}^d<\/annotation><\/semantics><\/math> we are interested in sampling from. For our purposes, think the distribution over plausible protein backbones, side-chain configurations, or full structures. We cannot evaluate <math><semantics><mrow><msub><mi>p<\/mi><mtext>data<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_{\\text{data}}<\/annotation><\/semantics><\/math> analytically and we cannot sample from it directly; what we have is a finite dataset of samples <math><semantics><mrow><mo stretchy=\"false\">{<\/mo><msup><mi mathvariant=\"bold\">x<\/mi><mrow><mo stretchy=\"false\">(<\/mo><mi>i<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><\/msup><msubsup><mo stretchy=\"false\">}<\/mo><mrow><mi>i<\/mi><mo>=<\/mo><mn>1<\/mn><\/mrow><mi>N<\/mi><\/msubsup><mo>\u223c<\/mo><msub><mi>p<\/mi><mtext>data<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\{\\mathbf{x}^{(i)}\\}_{i=1}^{N} \\sim p_{\\text{data}}<\/annotation><\/semantics><\/math>. We want a generative model that produces fresh samples from <math><semantics><mrow><msub><mi>p<\/mi><mtext>data<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_{\\text{data}}<\/annotation><\/semantics><\/math> at inference time.<\/p>\n\n\n\n<p class=\"\">The strategy that diffusion and score-based models all share is the same: introduce a <em>tractable prior<\/em> <math><semantics><mrow><msub><mi>p<\/mi><mtext>prior<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_{\\text{prior}}<\/annotation><\/semantics><\/math> (typically an isotropic Gaussian <math><semantics><mrow><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathcal{N}(\\mathbf{0}, \\mathbf{I})<\/annotation><\/semantics><\/math>) that we can sample from trivially, and learn a <em>map<\/em> that transforms samples from <math><semantics><mrow><msub><mi>p<\/mi><mtext>prior<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_{\\text{prior}}<\/annotation><\/semantics><\/math> into samples from <math><semantics><mrow><msub><mi>p<\/mi><mtext>data<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_{\\text{data}}<\/annotation><\/semantics><\/math>. The methods differ in how they construct this map: as the reverse of a noising Markov chain (DDPM), or as the time-integral of a reverse-time stochastic differential equation (score SDE), with the deterministic probability-flow ODE as a natural sibling. But the goal is identical, and the underlying object (a parameterised, time-dependent transport from a tractable prior to an intractable data distribution) is identical across all of them.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Sampling_diffusion_example-scaled.png?ssl=1\"><img decoding=\"async\" width=\"2560\" height=\"493\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Sampling_diffusion_example-scaled.png?fit=625%2C120&amp;ssl=1\" alt=\"\" class=\"wp-image-14285\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Sampling_diffusion_example-scaled.png?w=2560&amp;ssl=1 2560w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Sampling_diffusion_example-scaled.png?resize=300%2C58&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Sampling_diffusion_example-scaled.png?resize=1024%2C197&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Sampling_diffusion_example-scaled.png?resize=768%2C148&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Sampling_diffusion_example-scaled.png?resize=1536%2C296&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Sampling_diffusion_example-scaled.png?resize=2048%2C395&amp;ssl=1 2048w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Sampling_diffusion_example-scaled.png?resize=624%2C120&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Sampling_diffusion_example-scaled.png?w=1250&amp;ssl=1 1250w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Sampling_diffusion_example-scaled.png?w=1875&amp;ssl=1 1875w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<p class=\"\">A useful equation to keep in mind throughout is the <em>Gaussian probability path<\/em><\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi>\u03c3<\/mi><mi>t<\/mi><\/msub><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo>\u223c<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t \\;=\\; \\alpha_t\\,\\mathbf{x}_0 \\;+\\; \\sigma_t\\,\\boldsymbol{\\epsilon},\\qquad \\boldsymbol{\\epsilon}\\sim\\mathcal{N}(\\mathbf{0},\\mathbf{I}),<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">shared by all the methods we will discuss. The differences between DDPM, DDIM, VP\/VE-SDE, and EDM can largely be read off as different choices of <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi>\u03c3<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(\\alpha_t, \\sigma_t)<\/annotation><\/semantics><\/math> and different ways of turning a learnt regression target back into a sample.<\/p>\n\n\n\n<div class=\"h-16\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">1. DDPM: Denoising as a Hierarchical VAE<\/h2>\n\n\n\n<p class=\"\">The modern diffusion era begins with <strong><a href=\"#ref-ho2020\">Ho, Jain &amp; Abbeel&#8217;s &#8220;Denoising Diffusion Probabilistic Models&#8221;<\/a><\/strong> ([<a href=\"#ref-ho2020\">Ho et al., 2020<\/a>](#ref-ho2020)), which built on the variational construction of <a href=\"#ref-sohldickstein2015\">Sohl-Dickstein et al. (2015)<\/a> and simplified it into a recipe that could be trained reliably at the scale of CIFAR-10 and LSUN. The contribution is conceptually two-part: a clever choice of variational objective that turns generative modelling into a denoising problem, and an empirical finding that a <em>simplified<\/em> version of that objective trains better than the principled one.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The forward (noising) process<\/h3>\n\n\n\n<p class=\"\">DDPM defines a fixed, parameter-free Markov chain that gradually corrupts a data point <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>\u223c<\/mo><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0 \\sim q(\\mathbf{x}_0)<\/annotation><\/semantics><\/math> into Gaussian noise over <math><semantics><mrow><mi>T<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">T<\/annotation><\/semantics><\/math> discrete steps. With a <em>variance schedule<\/em> <math><semantics><mrow><mo stretchy=\"false\">{<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><msubsup><mo stretchy=\"false\">}<\/mo><mrow><mi>t<\/mi><mo>=<\/mo><mn>1<\/mn><\/mrow><mi>T<\/mi><\/msubsup><mo>\u2282<\/mo><mo stretchy=\"false\">(<\/mo><mn>0<\/mn><mo separator=\"true\">,<\/mo><mn>1<\/mn><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\{\\beta_t\\}_{t=1}^{T}\\subset(0,1)<\/annotation><\/semantics><\/math>,<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi mathvariant=\"script\">N<\/mi><mtext>\u2009\u2063<\/mtext><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">;<\/mo><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo separator=\"true\">,<\/mo><mtext>\u2009<\/mtext><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><mi mathvariant=\"bold\">I<\/mi><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_t \\mid \\mathbf{x}_{t-1}) \\;=\\; \\mathcal{N}\\!\\bigl(\\mathbf{x}_t;\\sqrt{1-\\beta_t}\\,\\mathbf{x}_{t-1},\\,\\beta_t\\mathbf{I}\\bigr).<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">In the standard DDPM the schedule is <strong>linear<\/strong>: <math><semantics><mrow><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_t<\/annotation><\/semantics><\/math> grows linearly from <math><semantics><mrow><msub><mi>\u03b2<\/mi><mn>1<\/mn><\/msub><mo>=<\/mo><msup><mn>10<\/mn><mrow><mo>\u2212<\/mo><mn>4<\/mn><\/mrow><\/msup><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_1 = 10^{-4}<\/annotation><\/semantics><\/math> to <math><semantics><mrow><msub><mi>\u03b2<\/mi><mi>T<\/mi><\/msub><mo>=<\/mo><mn>2<\/mn><mo>\u00d7<\/mo><msup><mn>10<\/mn><mrow><mo>\u2212<\/mo><mn>2<\/mn><\/mrow><\/msup><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_T = 2\\times 10^{-2}<\/annotation><\/semantics><\/math> over <math><semantics><mrow><mi>T<\/mi><mo>=<\/mo><mn>1000<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">T = 1000<\/annotation><\/semantics><\/math> steps (<a href=\"#ref-ho2020\">Ho et al., 2020<\/a>).<\/p>\n\n\n\n<p class=\"\">It is worth pausing to understand what this conditional is doing. The new sample <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> is drawn from a Gaussian centred at <math><semantics><mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{1-\\beta_t}\\,\\mathbf{x}_{t-1}<\/annotation><\/semantics><\/math> with variance <math><semantics><mrow><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><mi mathvariant=\"bold\">I<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_t \\mathbf{I}<\/annotation><\/semantics><\/math>. Two observations are crucial here:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\"><br><p><strong>The mean of <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> equals the mean of <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_{t-1}<\/annotation><\/semantics><\/math> multiplied by <math><semantics><mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mo>&lt;<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{1-\\beta_t} &lt; 1<\/annotation><\/semantics><\/math>.<\/strong> Iterating this forward, the mean of <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> conditional on <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> is <math><semantics><mrow><msub><mo>\u220f<\/mo><mrow><mi>s<\/mi><mo>\u2264<\/mo><mi>t<\/mi><\/mrow><\/msub><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mi>s<\/mi><\/msub><\/mrow><\/msqrt><mo>\u22c5<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\prod_{s\\le t} \\sqrt{1-\\beta_s}\\cdot\\mathbf{x}_0<\/annotation><\/semantics><\/math>, a product of factors strictly less than one. Such a product converges to zero as <math><semantics><mrow><mi>t<\/mi><mo>\u2192<\/mo><mi>T<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t\\to T<\/annotation><\/semantics><\/math>. In words: the forward process gradually <em>forgets<\/em> the starting point and pushes the conditional mean toward the origin.<\/p><br><\/li>\n\n\n\n<li class=\"\"><br><p><strong>The variance accumulates from zero towards one.<\/strong> A short recursive calculation (below) shows that if <math><semantics><mrow><mrow><mi mathvariant=\"normal\">V<\/mi><mi mathvariant=\"normal\">a<\/mi><mi mathvariant=\"normal\">r<\/mi><\/mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mo>\u2264<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{Var}(\\mathbf{x}_0) \\le 1<\/annotation><\/semantics><\/math>, the conditional variance of <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t \\mid \\mathbf{x}_0<\/annotation><\/semantics><\/math> grows monotonically from <math><semantics><mrow><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">0<\/annotation><\/semantics><\/math> at <math><semantics><mrow><mi>t<\/mi><mo>=<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t=0<\/annotation><\/semantics><\/math> to a value close to <math><semantics><mrow><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">1<\/annotation><\/semantics><\/math> at <math><semantics><mrow><mi>t<\/mi><mo>=<\/mo><mi>T<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t = T<\/annotation><\/semantics><\/math>.<\/p><br><\/li>\n<\/ol>\n\n\n\n<p class=\"\">Putting these together: regardless of where <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> lives, <math><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>T<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_T \\mid \\mathbf{x}_0)<\/annotation><\/semantics><\/math> is approximately <math><semantics><mrow><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathcal{N}(\\mathbf{0}, \\mathbf{I})<\/annotation><\/semantics><\/math> for any reasonable <math><semantics><mrow><mi>\u03b2<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\beta<\/annotation><\/semantics><\/math> schedule. This is the property that makes the whole construction work.<\/p>\n\n\n\n<p class=\"\">The crucial algebraic fact about Gaussian Markov chains, and what makes diffusion tractable in the first place, is that the marginal at <em>any<\/em> step <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> conditioned on <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> admits a closed form. Setting <math><semantics><mrow><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\alpha_t = 1-\\beta_t<\/annotation><\/semantics><\/math> and <math><semantics><mrow><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><mo>=<\/mo><msubsup><mo>\u220f<\/mo><mrow><mi>s<\/mi><mo>=<\/mo><mn>1<\/mn><\/mrow><mi>t<\/mi><\/msubsup><msub><mi>\u03b1<\/mi><mi>s<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\bar{\\alpha}_t = \\prod_{s=1}^{t}\\alpha_s<\/annotation><\/semantics><\/math>,<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi mathvariant=\"script\">N<\/mi><mtext>\u2009\u2063<\/mtext><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">;<\/mo><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo separator=\"true\">,<\/mo><mtext>\u2009<\/mtext><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"bold\">I<\/mi><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><mtext>\u2005\u200a<\/mtext><mtext>\u2005\u200a<\/mtext><mo>\u27fa<\/mo><mtext>\u2005\u200a<\/mtext><mtext>\u2005\u200a<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo separator=\"true\">,<\/mo><mspace width=\"1em\"><\/mspace><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo>\u223c<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_t \\mid \\mathbf{x}_0) \\;=\\; \\mathcal{N}\\!\\bigl(\\mathbf{x}_t;\\sqrt{\\bar{\\alpha}_t}\\,\\mathbf{x}_0,\\,(1-\\bar{\\alpha}_t)\\mathbf{I}\\bigr)\n\\;\\;\\Longleftrightarrow\\;\\;\n\\mathbf{x}_t \\;=\\; \\sqrt{\\bar{\\alpha}_t}\\,\\mathbf{x}_0 \\;+\\; \\sqrt{1-\\bar{\\alpha}_t}\\,\\boldsymbol{\\epsilon},\\quad \\boldsymbol{\\epsilon}\\sim\\mathcal{N}(\\mathbf{0},\\mathbf{I}).<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">This is the <em>reparameterisation<\/em>: rather than simulating <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> forward steps, we can sample <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> directly from <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> in one go, using a single Gaussian noise draw and a precomputed <math><semantics><mrow><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\bar{\\alpha}_t<\/annotation><\/semantics><\/math>.<\/p>\n\n\n\n<div class=\"bg-blue-50\/50 dark:bg-blue-900\/20 p-4 rounded-lg border border-blue-100 dark:border-blue-800 my-4\">\n<p><em>Intuition:<\/em> The forward process gradually destroys information by adding noise. The signal <math><semantics><mrow><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{\\bar{\\alpha}_t}\\,\\mathbf{x}_0<\/annotation><\/semantics><\/math> is rescaled <em>down<\/em> as <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> grows, while the noise <math><semantics><mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{1-\\bar{\\alpha}_t}\\,\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> is rescaled <em>up<\/em>; together they keep the total variance at unity (when <math><semantics><mrow><mrow><mi mathvariant=\"normal\">V<\/mi><mi mathvariant=\"normal\">a<\/mi><mi mathvariant=\"normal\">r<\/mi><\/mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{Var}(\\mathbf{x}_0) = 1<\/annotation><\/semantics><\/math>). This is what we will later call a <em>variance-preserving<\/em> path.<\/p>\n<p>It is worth pausing on how nontrivial this is. <strong>No matter where you start<\/strong>, a protein structure, an image of a cat, a noisy sketch, applying small Gaussian perturbations enough times always lands you at the same easy-to-sample distribution: an isotropic Gaussian. The forward direction is universal. The generative problem reduces to <strong>inverting this &#8220;universal contraction&#8221;<\/strong>: if we can learn to undo the corruption step-by-step, we can sample new data points by starting at the Gaussian prior and reversing the chain.<\/p>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">A latent-variable view, and why we need a variational bound<\/h3>\n\n\n\n<p class=\"\">Suppose we want to use the reverse direction to draw samples. We define a generative model <math><semantics><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_\\theta<\/annotation><\/semantics><\/math> that parameterises a <em>reverse<\/em> Markov chain,<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>0<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi>p<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>T<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><munderover><mo>\u220f<\/mo><mrow><mi>t<\/mi><mo>=<\/mo><mn>1<\/mn><\/mrow><mi>T<\/mi><\/munderover><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">p_\\theta(\\mathbf{x}_{0:T}) \\;=\\; p(\\mathbf{x}_T)\\,\\prod_{t=1}^{T} p_\\theta(\\mathbf{x}_{t-1}\\mid\\mathbf{x}_t),<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">where the prior <math><semantics><mrow><mi>p<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>T<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">p(\\mathbf{x}_T) = \\mathcal{N}(\\mathbf{0},\\mathbf{I})<\/annotation><\/semantics><\/math> matches the limiting marginal of the forward process, and each reverse transition is a Gaussian with parameters <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">\u03a3<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(\\boldsymbol{\\mu}_\\theta(\\mathbf{x}_t,t), \\boldsymbol{\\Sigma}_\\theta(\\mathbf{x}_t,t))<\/annotation><\/semantics><\/math> predicted by a neural network. To train, we want to maximise the likelihood <math><semantics><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">p_\\theta(\\mathbf{x}_0)<\/annotation><\/semantics><\/math> of the observed data. Why the <em>joint<\/em> and not just <math><semantics><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">p_\\theta(\\mathbf{x}_0)<\/annotation><\/semantics><\/math> directly? Because we cannot evaluate <math><semantics><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">p_\\theta(\\mathbf{x}_0)<\/annotation><\/semantics><\/math> in closed form: it is the marginal of the joint over all intermediate states:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mo>\u222b<\/mo><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>0<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>1<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">p_\\theta(\\mathbf{x}_0) \\;=\\; \\int p_\\theta(\\mathbf{x}_{0:T})\\,\\mathrm{d}\\mathbf{x}_{1:T}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">The integral on the right is over <math><semantics><mrow><mi>T<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">T<\/annotation><\/semantics><\/math> high-dimensional latent variables and is intractable. This is the standard situation in latent-variable models, and the standard fix, due to the VAE literature, is to lower-bound the log-likelihood with the <strong>evidence lower bound (ELBO)<\/strong>. We introduce a <em>variational distribution<\/em> <math><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>1<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_{1:T}\\mid\\mathbf{x}_0)<\/annotation><\/semantics><\/math> over the latents, multiply and divide by it inside the integral, and apply Jensen&#8217;s inequality to the resulting log of an expectation:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi>log<\/mi><mo>\u2061<\/mo><mo>\u222b<\/mo><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>1<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mfrac><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>0<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>1<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><\/mfrac><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>1<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mtext>\u2005\u200a<\/mtext><mo>\u2265<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi mathvariant=\"double-struck\">E<\/mi><mi>q<\/mi><\/msub><mtext>\u2009\u2063<\/mtext><mrow><mo fence=\"true\">[<\/mo><mi>log<\/mi><mo>\u2061<\/mo><mfrac><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>0<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>1<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><\/mfrac><mo fence=\"true\">]<\/mo><\/mrow><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\log p_\\theta(\\mathbf{x}_0)\n\\;=\\; \\log \\int q(\\mathbf{x}_{1:T}\\mid\\mathbf{x}_0)\\,\\frac{p_\\theta(\\mathbf{x}_{0:T})}{q(\\mathbf{x}_{1:T}\\mid\\mathbf{x}_0)}\\,\\mathrm{d}\\mathbf{x}_{1:T}\n\\;\\ge\\; \\mathbb{E}_{q}\\!\\left[\\log\\frac{p_\\theta(\\mathbf{x}_{0:T})}{q(\\mathbf{x}_{1:T}\\mid\\mathbf{x}_0)}\\right].<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">The expectation on the right is the <em>evidence lower bound<\/em>. Its negative is the <strong>variational lower bound loss<\/strong>,<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"script\">L<\/mi><mtext>VLB<\/mtext><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mo>\u2212<\/mo><mtext>\u2009<\/mtext><msub><mi mathvariant=\"double-struck\">E<\/mi><mi>q<\/mi><\/msub><mtext>\u2009\u2063<\/mtext><mrow><mo fence=\"true\">[<\/mo><mi>log<\/mi><mo>\u2061<\/mo><mfrac><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>0<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>1<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><\/mfrac><mo fence=\"true\">]<\/mo><\/mrow><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathcal{L}_{\\text{VLB}} \\;=\\; -\\,\\mathbb{E}_{q}\\!\\left[\\log\\frac{p_\\theta(\\mathbf{x}_{0:T})}{q(\\mathbf{x}_{1:T}\\mid\\mathbf{x}_0)}\\right],<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">which we minimise to push the model joint toward the data joint induced by <math><semantics><mrow><mi>q<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">q<\/annotation><\/semantics><\/math>.<\/p>\n\n\n\n<p class=\"\">What makes diffusion special compared to a generic VAE is that we <em>fix<\/em> the variational distribution <math><semantics><mrow><mi>q<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">q<\/annotation><\/semantics><\/math> to be the known forward process, not learnt, not parameterised, just the simple Gaussian chain we constructed above. This is the conceptual key: a diffusion model is a hierarchical VAE with <math><semantics><mrow><mi>T<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">T<\/annotation><\/semantics><\/math> latents whose encoder is frozen by design and whose decoder is a learnable reverse Markov chain.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/DDPM_image-scaled.png?ssl=1\"><img decoding=\"async\" width=\"2560\" height=\"447\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/DDPM_image-scaled.png?fit=625%2C109&amp;ssl=1\" alt=\"\" class=\"wp-image-14287\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/DDPM_image-scaled.png?w=2560&amp;ssl=1 2560w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/DDPM_image-scaled.png?resize=300%2C52&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/DDPM_image-scaled.png?resize=1024%2C179&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/DDPM_image-scaled.png?resize=768%2C134&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/DDPM_image-scaled.png?resize=1536%2C268&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/DDPM_image-scaled.png?resize=2048%2C358&amp;ssl=1 2048w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/DDPM_image-scaled.png?resize=624%2C109&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/DDPM_image-scaled.png?w=1250&amp;ssl=1 1250w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/DDPM_image-scaled.png?w=1875&amp;ssl=1 1875w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<p class=\"\">The diagram above summarises the construction we have just walked through: a frozen forward Markov chain <math><semantics><mrow><mi>q<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">q<\/annotation><\/semantics><\/math> that destroys structure, and a learnable reverse chain <math><semantics><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_\\theta<\/annotation><\/semantics><\/math> that reconstructs it.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Decomposing <math><semantics><mrow><msub><mi mathvariant=\"script\">L<\/mi><mtext>VLB<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathcal{L}_{\\text{VLB}}<\/annotation><\/semantics><\/math><\/h3>\n\n\n\n<p class=\"\">The integral defining <math><semantics><mrow><msub><mi mathvariant=\"script\">L<\/mi><mtext>VLB<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathcal{L}_{\\text{VLB}}<\/annotation><\/semantics><\/math> is still high-dimensional, but it admits a clean per-step decomposition. Start by expanding the ratio inside the log, using the factorisations of <math><semantics><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>0<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">p_\\theta(\\mathbf{x}_{0:T})<\/annotation><\/semantics><\/math> and <math><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>1<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_{1:T}\\mid\\mathbf{x}_0)<\/annotation><\/semantics><\/math>:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi>log<\/mi><mo>\u2061<\/mo><mfrac><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>0<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>1<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><\/mfrac><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi>log<\/mi><mo>\u2061<\/mo><mi>p<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>T<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mo>+<\/mo><munderover><mo>\u2211<\/mo><mrow><mi>t<\/mi><mo>=<\/mo><mn>1<\/mn><\/mrow><mi>T<\/mi><\/munderover><mi>log<\/mi><mo>\u2061<\/mo><mfrac><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><\/mfrac><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\log\\frac{p_\\theta(\\mathbf{x}_{0:T})}{q(\\mathbf{x}_{1:T}\\mid\\mathbf{x}_0)} \\;=\\; \\log p(\\mathbf{x}_T) + \\sum_{t=1}^{T}\\log\\frac{p_\\theta(\\mathbf{x}_{t-1}\\mid\\mathbf{x}_t)}{q(\\mathbf{x}_t\\mid\\mathbf{x}_{t-1})}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">The forward conditionals <math><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_t\\mid\\mathbf{x}_{t-1})<\/annotation><\/semantics><\/math> are not directly comparable to the reverse conditionals <math><semantics><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">p_\\theta(\\mathbf{x}_{t-1}\\mid\\mathbf{x}_t)<\/annotation><\/semantics><\/math>: they go in opposite directions. To make them comparable, rewrite <math><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_t \\mid \\mathbf{x}_{t-1})<\/annotation><\/semantics><\/math> for <math><semantics><mrow><mi>t<\/mi><mo>\u2265<\/mo><mn>2<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t \\ge 2<\/annotation><\/semantics><\/math> using Bayes&#8217; rule, conditioning everything on <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math>:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><\/mfrac><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_t\\mid\\mathbf{x}_{t-1}) \\;=\\; q(\\mathbf{x}_t\\mid\\mathbf{x}_{t-1},\\mathbf{x}_0) \\;=\\; \\frac{q(\\mathbf{x}_{t-1}\\mid\\mathbf{x}_t,\\mathbf{x}_0)\\,q(\\mathbf{x}_t\\mid\\mathbf{x}_0)}{q(\\mathbf{x}_{t-1}\\mid\\mathbf{x}_0)}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">The first equality uses the Markov property of the forward chain; the second is Bayes. Substituting back and using a telescoping argument on the marginals (full derivation in [<a href=\"#ref-ho2020\">Ho et al., 2020<\/a>, appendix A](#ref-ho2020)), one obtains<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"script\">L<\/mi><mtext>VLB<\/mtext><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><munder><munder><mrow><msub><mi>D<\/mi><mrow><mi mathvariant=\"normal\">K<\/mi><mi mathvariant=\"normal\">L<\/mi><\/mrow><\/msub><mtext>\u2009\u2063<\/mtext><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>T<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u2225<\/mi><mtext>\u2009<\/mtext><mi>p<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>T<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><\/mrow><mo stretchy=\"true\">\u23df<\/mo><\/munder><msub><mi>L<\/mi><mi>T<\/mi><\/msub><\/munder><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><munderover><mo>\u2211<\/mo><mrow><mi>t<\/mi><mo>=<\/mo><mn>2<\/mn><\/mrow><mi>T<\/mi><\/munderover><munder><munder><mrow><msub><mi mathvariant=\"double-struck\">E<\/mi><mi>q<\/mi><\/msub><mtext>\u2009\u2063<\/mtext><mrow><mo fence=\"true\">[<\/mo><msub><mi>D<\/mi><mrow><mi mathvariant=\"normal\">K<\/mi><mi mathvariant=\"normal\">L<\/mi><\/mrow><\/msub><mtext>\u2009\u2063<\/mtext><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u2225<\/mi><mtext>\u2009<\/mtext><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><mo fence=\"true\">]<\/mo><\/mrow><\/mrow><mo stretchy=\"true\">\u23df<\/mo><\/munder><msub><mi>L<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/munder><mtext>\u2005\u200a<\/mtext><munder><munder><mrow><mo>\u2212<\/mo><mtext>\u2009<\/mtext><msub><mi mathvariant=\"double-struck\">E<\/mi><mi>q<\/mi><\/msub><mtext>\u2009\u2063<\/mtext><mrow><mo fence=\"true\">[<\/mo><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>1<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mo fence=\"true\">]<\/mo><\/mrow><\/mrow><mo stretchy=\"true\">\u23df<\/mo><\/munder><msub><mi>L<\/mi><mn>0<\/mn><\/msub><\/munder><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathcal{L}_{\\text{VLB}} \\;=\\; \\underbrace{D_{\\mathrm{KL}}\\!\\bigl(q(\\mathbf{x}_T\\mid\\mathbf{x}_0)\\,\\Vert\\,p(\\mathbf{x}_T)\\bigr)}_{L_T} \\;+\\; \\sum_{t=2}^{T}\\underbrace{\\mathbb{E}_{q}\\!\\left[D_{\\mathrm{KL}}\\!\\bigl(q(\\mathbf{x}_{t-1}\\mid\\mathbf{x}_t,\\mathbf{x}_0)\\,\\Vert\\,p_\\theta(\\mathbf{x}_{t-1}\\mid\\mathbf{x}_t)\\bigr)\\right]}_{L_{t-1}} \\;\\underbrace{-\\,\\mathbb{E}_{q}\\!\\left[\\log p_\\theta(\\mathbf{x}_0\\mid\\mathbf{x}_1)\\right]}_{L_0}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">This decomposition has a beautiful structure. <math><semantics><mrow><msub><mi>L<\/mi><mi>T<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">L_T<\/annotation><\/semantics><\/math> is essentially constant: it measures how close the forward marginal <math><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>T<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_T\\mid\\mathbf{x}_0)<\/annotation><\/semantics><\/math> is to the prior <math><semantics><mrow><mi>p<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>T<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">p(\\mathbf{x}_T) = \\mathcal{N}(\\mathbf{0},\\mathbf{I})<\/annotation><\/semantics><\/math>, and is approximately zero by construction. <math><semantics><mrow><msub><mi>L<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">L_0<\/annotation><\/semantics><\/math> is a final-step log-likelihood that becomes negligible at high <math><semantics><mrow><mi>T<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">T<\/annotation><\/semantics><\/math>. The interesting terms are the per-step KL divergences <math><semantics><mrow><msub><mi>L<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">L_{t-1}<\/annotation><\/semantics><\/math>, which compare the <em>true posterior<\/em> <math><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_{t-1}\\mid\\mathbf{x}_t,\\mathbf{x}_0)<\/annotation><\/semantics><\/math>, a Gaussian with closed form thanks to Bayes&#8217; rule on Gaussians, against the model&#8217;s reverse conditional <math><semantics><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">p_\\theta(\\mathbf{x}_{t-1}\\mid\\mathbf{x}_t)<\/annotation><\/semantics><\/math>.<\/p>\n\n\n\n<p class=\"\">The true posterior is<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi mathvariant=\"script\">N<\/mi><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo separator=\"true\">;<\/mo><mtext>\u2009<\/mtext><msub><mover accent=\"true\"><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mo separator=\"true\">,<\/mo><mtext>\u2009<\/mtext><msub><mover accent=\"true\"><mi>\u03b2<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mi mathvariant=\"bold\">I<\/mi><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_{t-1}\\mid\\mathbf{x}_t,\\mathbf{x}_0) \\;=\\; \\mathcal{N}\\bigl(\\mathbf{x}_{t-1};\\,\\tilde{\\boldsymbol{\\mu}}_t(\\mathbf{x}_t,\\mathbf{x}_0),\\,\\tilde{\\beta}_t\\mathbf{I}\\bigr),<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">with closed-form posterior mean and variance<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/mfrac><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><msqrt><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/mfrac><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><msub><mover accent=\"true\"><mi>\u03b2<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mrow><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/mfrac><mtext>\u2009<\/mtext><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\tilde{\\boldsymbol{\\mu}}_t(\\mathbf{x}_t,\\mathbf{x}_0) \\;=\\; \\frac{\\sqrt{\\bar{\\alpha}_{t-1}}\\,\\beta_t}{1-\\bar{\\alpha}_t}\\,\\mathbf{x}_0 \\;+\\; \\frac{\\sqrt{\\alpha_t}\\,(1-\\bar{\\alpha}_{t-1})}{1-\\bar{\\alpha}_t}\\,\\mathbf{x}_t,\\qquad \\tilde{\\beta}_t \\;=\\; \\frac{1-\\bar{\\alpha}_{t-1}}{1-\\bar{\\alpha}_t}\\,\\beta_t.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Here we can notice how the new mean of the gaussian is an interpolation between the noisy state in which we are at point <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> and the clean\/unnoised sample at time zero!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">From KL to MSE on means<\/h3>\n\n\n\n<p class=\"\">A key fact, originally due to <a href=\"#ref-feller1949\">Feller (1949)<\/a>, is that when the forward step <math><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_t \\mid \\mathbf{x}_{t-1})<\/annotation><\/semantics><\/math> is Gaussian with sufficiently small variance <math><semantics><mrow><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_t<\/annotation><\/semantics><\/math>, the <em>exact<\/em> reverse step <math><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_{t-1} \\mid \\mathbf{x}_t)<\/annotation><\/semantics><\/math> is also approximately Gaussian. This is the formal justification for parameterising the model&#8217;s reverse conditional as a Gaussian with fixed covariance: <math><semantics><mrow><msub><mi>p<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo separator=\"true\">;<\/mo><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo separator=\"true\">,<\/mo><mtext>\u2009<\/mtext><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">p_\\theta(\\mathbf{x}_{t-1}\\mid\\mathbf{x}_t) = \\mathcal{N}(\\mathbf{x}_{t-1};\\,\\boldsymbol{\\mu}_\\theta(\\mathbf{x}_t,t),\\,\\sigma_t^2\\mathbf{I})<\/annotation><\/semantics><\/math> with <math><semantics><mrow><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><mo>\u2208<\/mo><mo stretchy=\"false\">{<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mover accent=\"true\"><mi>\u03b2<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mo stretchy=\"false\">}<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_t^2 \\in \\{\\beta_t, \\tilde{\\beta}_t\\}<\/annotation><\/semantics><\/math>. The variance is <em>fixed<\/em> (not learned) to one of the two endpoints corresponding to the upper and lower bounds on the true reverse variance derived in <a href=\"#ref-ho2020\">Ho et al. (2020)<\/a>. The KL divergence between two Gaussians with the same covariance <math><semantics><mrow><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><mi mathvariant=\"bold\">I<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_t^2\\mathbf{I}<\/annotation><\/semantics><\/math> has a particularly simple form, it reduces to a scaled squared distance between the means:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi>D<\/mi><mrow><mi mathvariant=\"normal\">K<\/mi><mi mathvariant=\"normal\">L<\/mi><\/mrow><\/msub><mtext>\u2009\u2063<\/mtext><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><msub><mover accent=\"true\"><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u2225<\/mi><mtext>\u2009<\/mtext><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mi>\u03b8<\/mi><\/msub><mo separator=\"true\">,<\/mo><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mn>1<\/mn><mrow><mn>2<\/mn><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><\/mrow><\/mfrac><mtext>\u2009<\/mtext><mo stretchy=\"false\">\u2225<\/mo><msub><mover accent=\"true\"><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mo>\u2212<\/mo><msub><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mi>\u03b8<\/mi><\/msub><msup><mo stretchy=\"false\">\u2225<\/mo><mn>2<\/mn><\/msup><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">D_{\\mathrm{KL}}\\!\\bigl(\\mathcal{N}(\\tilde{\\boldsymbol{\\mu}}_t,\\sigma_t^2\\mathbf{I})\\,\\Vert\\,\\mathcal{N}(\\boldsymbol{\\mu}_\\theta,\\sigma_t^2\\mathbf{I})\\bigr) \\;=\\; \\frac{1}{2\\sigma_t^2}\\,\\lVert\\tilde{\\boldsymbol{\\mu}}_t &#8211; \\boldsymbol{\\mu}_\\theta\\rVert^2.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">So the entire per-step loss reduces to an <strong>MSE between the model&#8217;s predicted mean and the true posterior mean<\/strong>. The strategic question is then: <em>how should we parameterise the network&#8217;s output to make this MSE easy to learn?<\/em><\/p>\n\n\n\n<p class=\"\">Using the reparameterisation <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>=<\/mo><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2212<\/mo><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0 = (\\mathbf{x}_t &#8211; \\sqrt{1-\\bar{\\alpha}_t}\\,\\boldsymbol{\\epsilon})\/\\sqrt{\\bar{\\alpha}_t}<\/annotation><\/semantics><\/math>, that is, expressing <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> in terms of the noise <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> that was used to construct <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math>, and substituting into <math><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\tilde{\\boldsymbol{\\mu}}_t<\/annotation><\/semantics><\/math>, mechanical algebra gives<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mn>1<\/mn><msqrt><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><\/msqrt><\/mfrac><mtext>\u2009\u2063<\/mtext><mrow><mo fence=\"true\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2212<\/mo><mfrac><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mfrac><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo fence=\"true\">)<\/mo><\/mrow><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\tilde{\\boldsymbol{\\mu}}_t(\\mathbf{x}_t,\\boldsymbol{\\epsilon}) \\;=\\; \\frac{1}{\\sqrt{\\alpha_t}}\\!\\left(\\mathbf{x}_t &#8211; \\frac{\\beta_t}{\\sqrt{1-\\bar{\\alpha}_t}}\\,\\boldsymbol{\\epsilon}\\right).<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">This suggests parameterising the model&#8217;s mean by the <em>same<\/em> expression but with a learnt <math><semantics><mrow><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}_\\theta(\\mathbf{x}_t,t)<\/annotation><\/semantics><\/math> in place of the true <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mn>1<\/mn><msqrt><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><\/msqrt><\/mfrac><mtext>\u2009\u2063<\/mtext><mrow><mo fence=\"true\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2212<\/mo><mfrac><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mfrac><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo fence=\"true\">)<\/mo><\/mrow><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\mu}_\\theta(\\mathbf{x}_t,t) \\;=\\; \\frac{1}{\\sqrt{\\alpha_t}}\\!\\left(\\mathbf{x}_t &#8211; \\frac{\\beta_t}{\\sqrt{1-\\bar{\\alpha}_t}}\\,\\boldsymbol{\\epsilon}_\\theta(\\mathbf{x}_t,t)\\right).<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Substituting <math><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\tilde{\\boldsymbol{\\mu}}_t<\/annotation><\/semantics><\/math> and <math><semantics><mrow><msub><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mi>\u03b8<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\mu}_\\theta<\/annotation><\/semantics><\/math> into the KL expression and simplifying gives, after cancellation,<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi>L<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><msubsup><mi>\u03b2<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><mrow><mn>2<\/mn><mtext>\u2009<\/mtext><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><mtext>\u2009<\/mtext><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><mtext>\u2009<\/mtext><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><\/mfrac><mtext>\u2009<\/mtext><msub><mi mathvariant=\"double-struck\">E<\/mi><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><\/msub><mtext>\u2009\u2063<\/mtext><mrow><mo fence=\"true\">[<\/mo><mo stretchy=\"false\">\u2225<\/mo><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo>\u2212<\/mo><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><msup><mo stretchy=\"false\">\u2225<\/mo><mn>2<\/mn><\/msup><mo fence=\"true\">]<\/mo><\/mrow><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">L_{t-1} \\;=\\; \\frac{\\beta_t^2}{2\\,\\sigma_t^2\\,\\alpha_t\\,(1-\\bar{\\alpha}_t)}\\,\\mathbb{E}_{\\mathbf{x}_0,\\boldsymbol{\\epsilon}}\\!\\left[\\lVert\\boldsymbol{\\epsilon} &#8211; \\boldsymbol{\\epsilon}_\\theta(\\mathbf{x}_t,t)\\rVert^2\\right].<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Each per-step VLB term has thus reduced to a <em>weighted MSE<\/em> between the true noise (drawn when we constructed <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math>) and the network&#8217;s predicted noise. This is, structurally, vanilla supervised regression: predict <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> given <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math>.<\/p>\n\n\n\n<p class=\"\"><a href=\"#ref-ho2020\">Ho et al.<\/a> then made the empirical observation that <em>dropping the time-dependent weight<\/em> in front of this MSE produces strictly better samples than keeping the principled VLB weight. The resulting <strong>simple objective<\/strong> is the workhorse of modern diffusion:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"script\">L<\/mi><mtext>simple<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03b8<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi mathvariant=\"double-struck\">E<\/mi><mrow><mi>t<\/mi><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><\/msub><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">[<\/mo><mtext>\u2009<\/mtext><mo stretchy=\"false\">\u2225<\/mo><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo>\u2212<\/mo><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><msup><mo stretchy=\"false\">\u2225<\/mo><mn>2<\/mn><\/msup><mtext>\u2009<\/mtext><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">]<\/mo><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathcal{L}_{\\text{simple}}(\\theta) \\;=\\; \\mathbb{E}_{t,\\mathbf{x}_0,\\boldsymbol{\\epsilon}}\\bigl[\\,\\lVert \\boldsymbol{\\epsilon} &#8211; \\boldsymbol{\\epsilon}_\\theta(\\mathbf{x}_t,t)\\rVert^2\\,\\bigr],\\qquad \\mathbf{x}_t = \\sqrt{\\bar{\\alpha}_t}\\,\\mathbf{x}_0 + \\sqrt{1-\\bar{\\alpha}_t}\\,\\boldsymbol{\\epsilon}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">The dropped weighting is largest for small <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> (mild noise) and effectively <em>upweights<\/em> the harder high-noise levels, which empirically gives sharper samples (<a href=\"#ref-ho2020\">Ho et al., 2020<\/a>).<\/p>\n\n\n\n<div class=\"bg-blue-50\/50 dark:bg-blue-900\/20 p-4 rounded-lg border border-blue-100 dark:border-blue-800 my-4\">\n<p><em>This objective is extraordinarily efficient and scalable<\/em>, and this is, in my view, the single most important reason diffusion models took off so quickly. Look at what training requires at each step: pick a data point <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> from the training set, sample a timestep <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> uniformly, sample a single Gaussian noise vector <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>, form <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t = \\sqrt{\\bar{\\alpha}_t}\\,\\mathbf{x}_0 + \\sqrt{1-\\bar{\\alpha}_t}\\,\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> using <em>precomputed<\/em> <math><semantics><mrow><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\bar{\\alpha}_t<\/annotation><\/semantics><\/math> tables, and regress with one network forward-backward pass. <em>There is no need to simulate the trajectory.<\/em> There is no MCMC inner loop, no ODE solver in training, no need to even store any state between iterations. The training cost per step is identical to that of a vanilla supervised regressor. This is the property that made diffusion scale from CIFAR-10 to ImageNet to text-conditional generation to protein design without much architectural rethinking.<\/p>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Ancestral sampling<\/h3>\n\n\n\n<p class=\"\">At inference time we want new samples from the trained model, so we must run the Markov chain backward. Start from <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>T<\/mi><\/msub><mo>\u223c<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_T\\sim\\mathcal{N}(\\mathbf{0},\\mathbf{I})<\/annotation><\/semantics><\/math> and for <math><semantics><mrow><mi>t<\/mi><mo>=<\/mo><mi>T<\/mi><mo separator=\"true\">,<\/mo><mi>T<\/mi><mo>\u2212<\/mo><mn>1<\/mn><mo separator=\"true\">,<\/mo><mo>\u2026<\/mo><mo separator=\"true\">,<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t = T, T-1, \\dots, 1<\/annotation><\/semantics><\/math> apply the <strong>ancestral sampler<\/strong>:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mn>1<\/mn><msqrt><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><\/msqrt><\/mfrac><mtext>\u2009\u2063<\/mtext><mrow><mo fence=\"true\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2212<\/mo><mfrac><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mfrac><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo fence=\"true\">)<\/mo><\/mrow><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi>\u03c3<\/mi><mi>t<\/mi><\/msub><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">z<\/mi><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><mi mathvariant=\"bold\">z<\/mi><mo>\u223c<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><mo separator=\"true\">,<\/mo><mtext>\u2005\u200a<\/mtext><mtext>\u2005\u200a<\/mtext><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><mo>\u2208<\/mo><mo stretchy=\"false\">{<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mover accent=\"true\"><mi>\u03b2<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mo stretchy=\"false\">}<\/mo><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_{t-1} \\;=\\; \\frac{1}{\\sqrt{\\alpha_t}}\\!\\left(\\mathbf{x}_t &#8211; \\frac{\\beta_t}{\\sqrt{1-\\bar{\\alpha}_t}}\\,\\boldsymbol{\\epsilon}_\\theta(\\mathbf{x}_t,t)\\right) \\;+\\; \\sigma_t\\,\\mathbf{z},\\qquad \\mathbf{z}\\sim\\mathcal{N}(\\mathbf{0},\\mathbf{I}),\\;\\;\\sigma_t^2\\in\\{\\beta_t,\\tilde{\\beta}_t\\}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">We can observe here as well how the new less noisy sample at time <math><semantics><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t-1<\/annotation><\/semantics><\/math> is indeed our sample at time <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> minus the noise that the networks recognizes in the sample. Heuristically, we are doing a step towards the denoised sample (away from the noise).<\/p>\n\n\n\n<p class=\"\">A later refinement worth flagging here: <strong>[<a href=\"#ref-nichol2021\">Nichol &amp; Dhariwal (2021)<\/a>](#ref-nichol2021)<\/strong> showed that a <strong>cosine schedule<\/strong>,<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><mi>f<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mi>f<\/mi><mo stretchy=\"false\">(<\/mo><mn>0<\/mn><mo stretchy=\"false\">)<\/mo><\/mrow><\/mfrac><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><mi>f<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msup><mrow><mi>cos<\/mi><mo>\u2061<\/mo><\/mrow><mn>2<\/mn><\/msup><mtext>\u2009\u2063<\/mtext><mo fence=\"true\" maxsize=\"1.8em\" minsize=\"1.8em\" stretchy=\"true\">(<\/mo><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mrow><mi>t<\/mi><mi mathvariant=\"normal\">\/<\/mi><mi>T<\/mi><mo>+<\/mo><mi>s<\/mi><\/mrow><mrow><mn>1<\/mn><mo>+<\/mo><mi>s<\/mi><\/mrow><\/mfrac><\/mstyle><mo>\u22c5<\/mo><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mi>\u03c0<\/mi><mn>2<\/mn><\/mfrac><\/mstyle><mo fence=\"true\" maxsize=\"1.8em\" minsize=\"1.8em\" stretchy=\"true\">)<\/mo><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><mi>s<\/mi><mo>=<\/mo><mn>0.008<\/mn><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\bar{\\alpha}_t \\;=\\; \\frac{f(t)}{f(0)},\\qquad f(t) \\;=\\; \\cos^2\\!\\Bigl(\\tfrac{t\/T+s}{1+s}\\cdot\\tfrac{\\pi}{2}\\Bigr),\\qquad s = 0.008,<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">avoids the pathology of the linear schedule destroying low-resolution signal too quickly; <math><semantics><mrow><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\bar{\\alpha}_t<\/annotation><\/semantics><\/math> stays near 1 for longer in the early steps and then collapses smoothly. They also showed that <em>learning<\/em> a log-interpolation between <math><semantics><mrow><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_t<\/annotation><\/semantics><\/math> and <math><semantics><mrow><msub><mover accent=\"true\"><mi>\u03b2<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\tilde{\\beta}_t<\/annotation><\/semantics><\/math> for the reverse covariance improves log-likelihood. Both choices are now standard.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">DDPM in the wild: the original RFDiffusion sampler<\/h3>\n\n\n\n<p class=\"\">The original RFDiffusion (<a href=\"#ref-watson2023\">Watson et al., 2023<\/a>), one of the model that launched de-novo protein design as we know it today, is in this DDPM lineage. Its inner sampling loop in <a href=\"https:\/\/github.com\/RosettaCommons\/RFdiffusion\/blob\/main\/rfdiffusion\/inference\/utils.py\"><code>rfdiffusion\/inference\/utils.py<\/code><\/a> implements <em>exactly<\/em> the ancestral update we have just derived. The key helper, <code>get_mu_xt_x0<\/code>, computes the posterior mean <math><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\tilde{\\boldsymbol{\\mu}}_t(\\mathbf{x}_t, \\mathbf{x}_0)<\/annotation><\/semantics><\/math> and posterior variance <math><semantics><mrow><msub><mover accent=\"true\"><mi>\u03b2<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\tilde{\\beta}_t<\/annotation><\/semantics><\/math> from cached <math><semantics><mrow><mi>\u03b2<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\beta<\/annotation><\/semantics><\/math> and <math><semantics><mrow><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><\/mrow><annotation encoding=\"application\/x-tex\">\\bar{\\alpha}<\/annotation><\/semantics><\/math> schedules:<\/p>\n\n\n\n<pre><code class=\"language-python\">def get_mu_xt_x0(xt, px0, t, beta_schedule, alphabar_schedule, eps=1e-6):\n    \"\"\"Given xt, predicted x0 and the timestep t, give mu of x(t-1).\"\"\"\n    t_idx = t - 1\n    # (1) Posterior variance: sigma = ((1 - alphabar_{t-1}) \/ (1 - alphabar_t)) * beta_t\n    #     This is exactly tilde{beta}_t.\n    sigma = (\n        (1 - alphabar_schedule[t_idx - 1]) \/ (1 - alphabar_schedule[t_idx])\n    ) * beta_schedule[t_idx]\n    xt_ca  = xt[:, 1, :]    # C-alpha coordinates at time t\n    px0_ca = px0[:, 1, :]   # network's prediction of clean C-alpha coordinates\n    # (2) First term of the posterior mean: coefficient on x_0.\n    a = (\n        (torch.sqrt(alphabar_schedule[t_idx - 1] + eps) * beta_schedule[t_idx])\n        \/ (1 - alphabar_schedule[t_idx])\n    ) * px0_ca\n    # (3) Second term of the posterior mean: coefficient on x_t,\n    #     with sqrt(1 - beta_t) = sqrt(alpha_t).\n    b = (\n        (torch.sqrt(1 - beta_schedule[t_idx] + eps) * (1 - alphabar_schedule[t_idx - 1]))\n        \/ (1 - alphabar_schedule[t_idx])\n    ) * xt_ca\n    mu = a + b\n    return mu, sigma\n<\/code><\/pre>\n\n\n\n<p class=\"\">And the helper that drives one ancestral step, <code>get_next_ca<\/code>, samples <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_{t-1}<\/annotation><\/semantics><\/math> from the Gaussian <math><semantics><mrow><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><msub><mover accent=\"true\"><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mover accent=\"true\"><mi>\u03b2<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathcal{N}(\\tilde{\\boldsymbol{\\mu}}_t, \\tilde{\\beta}_t\\mathbf{I})<\/annotation><\/semantics><\/math> in one line:<\/p>\n\n\n\n<pre><code class=\"language-python\">def get_next_ca(xt, px0, t, ..., noise_scale=1.0):\n    # Compute posterior mean and variance from the cached schedules.\n    mu, sigma = get_mu_xt_x0(xt, px0, t, beta_schedule, alphabar_schedule)\n    # (4) Ancestral sampling: x_{t-1} ~ N(mu, sigma * I).\n    sampled_crds = torch.normal(mu, torch.sqrt(sigma * noise_scale))\n    delta = sampled_crds - xt[:, 1, :]\n    ...\n    out_crds = xt + delta[:, None, :]\n    return out_crds \/ crd_scale, delta \/ crd_scale\n<\/code><\/pre>\n\n\n\n<p class=\"\">Each numbered line is a <em>line-for-line transcription<\/em> of an equation from Section 1.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\"><strong>Lines (1)<\/strong> is the closed-form DDPM posterior variance <math><semantics><mrow><msub><mover accent=\"true\"><mi>\u03b2<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mo>=<\/mo><mfrac><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mrow><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/mfrac><mtext>\u2009<\/mtext><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\tilde{\\beta}_t = \\tfrac{1-\\bar{\\alpha}_{t-1}}{1-\\bar{\\alpha}_t}\\,\\beta_t<\/annotation><\/semantics><\/math>, with <code>alphabar_schedule<\/code> and <code>beta_schedule<\/code> precomputed at startup.<\/li>\n\n\n\n<li class=\"\"><strong>Lines (2) and (3)<\/strong> are the two terms of the closed-form posterior mean<br><math><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/mfrac><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><msqrt><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/mfrac><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\tilde{\\boldsymbol{\\mu}}_t(\\mathbf{x}_t, \\mathbf{x}_0) \\;=\\; \\frac{\\sqrt{\\bar{\\alpha}_{t-1}}\\,\\beta_t}{1-\\bar{\\alpha}_t}\\,\\mathbf{x}_0 \\;+\\; \\frac{\\sqrt{\\alpha_t}\\,(1-\\bar{\\alpha}_{t-1})}{1-\\bar{\\alpha}_t}\\,\\mathbf{x}_t,<\/annotation><\/semantics><\/math><br>with <math><semantics><mrow><msqrt><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{\\alpha_t}<\/annotation><\/semantics><\/math> written out as <math><semantics><mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{1-\\beta_t}<\/annotation><\/semantics><\/math> and <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> replaced by the network&#8217;s prediction <code>px0<\/code> (an <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math>-parameterisation rather than <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>-parameterisation, but as we will see in Section 2 the two are linear transformations of each other in this Gaussians setting).<\/li>\n\n\n\n<li class=\"\"><strong>Line (4)<\/strong> is the ancestral sampling step itself: draw <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u223c<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><msub><mover accent=\"true\"><mi mathvariant=\"bold-italic\">\u03bc<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mtext>\u2009<\/mtext><msub><mover accent=\"true\"><mi>\u03b2<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_{t-1} \\sim \\mathcal{N}(\\tilde{\\boldsymbol{\\mu}}_t,\\,\\tilde{\\beta}_t\\mathbf{I})<\/annotation><\/semantics><\/math>, exactly the per-step update from Section 1. The <code>noise_scale<\/code> knob is a protein-design-specific multiplier that lets the user trade off sample diversity against fidelity to the training distribution; setting <code>noise_scale = 1.0<\/code> recovers the textbook DDPM sampler.<\/li>\n<\/ul>\n\n\n\n<p class=\"\">One little note to keep in mind here: RFDiffusion uses this Gaussian DDPM update only on the <em>translational<\/em> (C-alpha) part of the structure: rotations are diffused on <math><semantics><mrow><mi>S<\/mi><mi>O<\/mi><mo stretchy=\"false\">(<\/mo><mn>3<\/mn><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">SO(3)<\/annotation><\/semantics><\/math> with an IGSO(3) reverse process (the <code>get_next_frames<\/code> helper alongside <code>get_next_ca<\/code> in the same file), because the Gaussian formalism does not transfer cleanly to a curved manifold.<\/p>\n\n\n\n<div class=\"h-16\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">2. DDIM: Non-Markovian Generalisation and Deterministic Sampling<\/h2>\n\n\n\n<p class=\"\">As mentioned above, the DDPM formulation allows for highly efficient training that can be parallelised and therefore scaled. At sampling time, however, the Markov structure of the process forces us to start from the tractable distribution at <math><semantics><mrow><mi>T<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">T<\/annotation><\/semantics><\/math> and traverse every step of the learned chain in reverse. This meant that even with a perfectly trained model one needed hundreds to thousands of network evaluations to produce a single sample. <strong><a href=\"#ref-song2021a\">Song, Meng &amp; Ermon (2021a)<\/a><\/strong> asked therefore the following question: <em>does the DDPM training objective actually <em>require<\/em> the forward process to be Markov?<\/em><\/p>\n\n\n\n<p class=\"\">The answer is no, and the consequence is a sampler that converges in 20\u201350 steps instead of 1000.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">A first observation: noise-prediction is denoised-image-prediction<\/h3>\n\n\n\n<p class=\"\">Before deriving DDIM, it is worth absorbing the equivalence between two natural parameterisations of the network&#8217;s output. The forward marginal is<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t \\;=\\; \\sqrt{\\bar{\\alpha}_t}\\,\\mathbf{x}_0 \\;+\\; \\sqrt{1-\\bar{\\alpha}_t}\\,\\boldsymbol{\\epsilon}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">This is a <em>linear<\/em> relationship between three quantities (<math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math>, <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math>, <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>). Given any two of them we can solve for the third:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2212<\/mo><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><\/mfrac><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2212<\/mo><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mfrac><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0 \\;=\\; \\frac{\\mathbf{x}_t &#8211; \\sqrt{1-\\bar{\\alpha}_t}\\,\\boldsymbol{\\epsilon}}{\\sqrt{\\bar{\\alpha}_t}},\\qquad \\boldsymbol{\\epsilon} \\;=\\; \\frac{\\mathbf{x}_t &#8211; \\sqrt{\\bar{\\alpha}_t}\\,\\mathbf{x}_0}{\\sqrt{1-\\bar{\\alpha}_t}}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">So a network that takes <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> as input and predicts <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> is <em>equivalent<\/em> to a network that predicts <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math>: each prediction is a fixed linear transformation of the other (with coefficients that depend on <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> but not on the data). In particular, if the network outputs <math><semantics><mrow><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}_\\theta(\\mathbf{x}_t,t)<\/annotation><\/semantics><\/math> we can immediately extract a denoised-image estimate<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mn>0<\/mn><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2212<\/mo><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><\/mfrac><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\hat{\\mathbf{x}}_0(\\mathbf{x}_t,t) \\;=\\; \\frac{\\mathbf{x}_t &#8211; \\sqrt{1-\\bar{\\alpha}_t}\\,\\boldsymbol{\\epsilon}_\\theta(\\mathbf{x}_t,t)}{\\sqrt{\\bar{\\alpha}_t}}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">This <math><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\hat{\\mathbf{x}}_0<\/annotation><\/semantics><\/math> is essentially the <strong>Tweedie-formula one-step denoiser<\/strong>: the network&#8217;s best guess of the clean image given the noisy observation. We will use it heavily in what follows.<\/p>\n\n\n\n<div class=\"bg-blue-50\/50 dark:bg-blue-900\/20 p-4 rounded-lg border border-blue-100 dark:border-blue-800 my-4\">\n<p><em>Two equivalent ways to think about a denoising step.<\/em> Both parameterisations describe exactly the same operation, but they invite different mental pictures.<\/p>\n<ul>\n<li><strong>Predict the clean image, then take a small step toward it.<\/strong> Feed <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> to the network and ask &#8220;what does the clean image <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> look like?&#8221; Recall from the discussion of the true posterior mean above that one step of denoising is essentially an <em>interpolation between the noisy state <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> and the predicted clean state <math><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\hat{\\mathbf{x}}_0<\/annotation><\/semantics><\/math><\/em>, with the interpolation weight controlled by the schedule. So a step of the reverse chain literally drags the sample a little closer to where the network thinks the data lives.<\/li>\n<li><strong>Predict the noise, then subtract a small fraction of it.<\/strong> Equivalently, ask &#8220;what noise was added to produce <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math>?&#8221; and remove a small portion of it, taking a step <em>away<\/em> from the noise direction. The two views are linked by the linear relation <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>=<\/mo><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2212<\/mo><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0 = (\\mathbf{x}_t &#8211; \\sqrt{1-\\bar{\\alpha}_t}\\,\\boldsymbol{\\epsilon})\/\\sqrt{\\bar{\\alpha}_t}<\/annotation><\/semantics><\/math>, so the two updates are mathematically identical.<\/li>\n<\/ul>\n<p>The DDPM paper picked <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>-prediction because the regression target has unit variance by construction (the network is asked to predict an <math><semantics><mrow><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathcal{N}(\\mathbf{0}, \\mathbf{I})<\/annotation><\/semantics><\/math> vector at every <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math>), which makes the network well-conditioned across the entire schedule. Different downstream methods will switch between parameterisations as convenient.<\/p>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">A family of non-Markovian forward processes<\/h3>\n\n\n\n<p class=\"\">The DDPM loss <math><semantics><mrow><msub><mi mathvariant=\"script\">L<\/mi><mtext>simple<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathcal{L}_{\\text{simple}}<\/annotation><\/semantics><\/math> depends only on the <em>marginals<\/em> <math><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_t\\mid\\mathbf{x}_0)<\/annotation><\/semantics><\/math>: at each step we sample one <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> from this marginal and regress. It does <em>not<\/em> depend on the joint <math><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mn>1<\/mn><mo>:<\/mo><mi>T<\/mi><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_{1:T}\\mid\\mathbf{x}_0)<\/annotation><\/semantics><\/math> in any way that involves the inter-step structure. <a href=\"#ref-song2021a\">Song et al. (2021a)<\/a> leverage this: they consider an entire <em>family<\/em> of (possibly non-Markovian) forward processes that share the same per-step marginals as DDPM but differ in their <em>posterior<\/em> conditionals <math><semantics><mrow><msub><mi>q<\/mi><mi>\u03c3<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q_\\sigma(\\mathbf{x}_{t-1}\\mid\\mathbf{x}_t,\\mathbf{x}_0)<\/annotation><\/semantics><\/math>. Each member of this family corresponds to a valid sampler that re-uses the <strong>same trained noise predictor<\/strong> <math><semantics><mrow><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}_\\theta<\/annotation><\/semantics><\/math>, no retraining needed.<\/p>\n\n\n\n<p class=\"\">The family is indexed by a sequence of stochasticity parameters <math><semantics><mrow><mo stretchy=\"false\">{<\/mo><msub><mi>\u03c3<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">}<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\{\\sigma_t\\}<\/annotation><\/semantics><\/math> and posits<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi>q<\/mi><mi>\u03c3<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi mathvariant=\"script\">N<\/mi><mtext>\u2009\u2063<\/mtext><mo fence=\"true\" maxsize=\"1.8em\" minsize=\"1.8em\" stretchy=\"true\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo separator=\"true\">;<\/mo><mtext>\u2009<\/mtext><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2212<\/mo><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><\/mrow><\/msqrt><mo>\u22c5<\/mo><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2212<\/mo><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mfrac><\/mstyle><mo separator=\"true\">,<\/mo><mtext>\u2009<\/mtext><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><mi mathvariant=\"bold\">I<\/mi><mo fence=\"true\" maxsize=\"1.8em\" minsize=\"1.8em\" stretchy=\"true\">)<\/mo><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">q_\\sigma(\\mathbf{x}_{t-1}\\mid\\mathbf{x}_t,\\mathbf{x}_0) \\;=\\; \\mathcal{N}\\!\\Bigl(\\mathbf{x}_{t-1};\\,\\sqrt{\\bar{\\alpha}_{t-1}}\\,\\mathbf{x}_0 + \\sqrt{1-\\bar{\\alpha}_{t-1}-\\sigma_t^2}\\cdot\\tfrac{\\mathbf{x}_t &#8211; \\sqrt{\\bar{\\alpha}_t}\\,\\mathbf{x}_0}{\\sqrt{1-\\bar{\\alpha}_t}},\\,\\sigma_t^2\\mathbf{I}\\Bigr).<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">The structure of this conditional is worth absorbing. The mean is a sum of two terms:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\"><math><semantics><mrow><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{\\bar{\\alpha}_{t-1}}\\,\\mathbf{x}_0<\/annotation><\/semantics><\/math>: the <em>signal component<\/em> at noise level <math><semantics><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t-1<\/annotation><\/semantics><\/math>, i.e. where <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> would project under the forward marginal at time <math><semantics><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t-1<\/annotation><\/semantics><\/math>;<\/li>\n\n\n\n<li class=\"\"><math><semantics><mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2212<\/mo><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><\/mrow><\/msqrt><mo>\u22c5<\/mo><mfrac><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2212<\/mo><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mfrac><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{1-\\bar{\\alpha}_{t-1}-\\sigma_t^2}\\cdot\\tfrac{\\mathbf{x}_t &#8211; \\sqrt{\\bar{\\alpha}_t}\\,\\mathbf{x}_0}{\\sqrt{1-\\bar{\\alpha}_t}}<\/annotation><\/semantics><\/math>: a scaled version of the noise direction <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> implied by the pair <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(\\mathbf{x}_t, \\mathbf{x}_0)<\/annotation><\/semantics><\/math>.<\/li>\n<\/ul>\n\n\n\n<p class=\"\">Plus a Gaussian fluctuation of variance <math><semantics><mrow><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><mi mathvariant=\"bold\">I<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_t^2 \\mathbf{I}<\/annotation><\/semantics><\/math>. <a href=\"#ref-song2021a\">Song et al. (2021a)<\/a> prove that, for any choice of <math><semantics><mrow><mo stretchy=\"false\">{<\/mo><msub><mi>\u03c3<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">}<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\{\\sigma_t\\}<\/annotation><\/semantics><\/math>, this family yields the <em>same<\/em> marginals <math><semantics><mrow><mi>q<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">q(\\mathbf{x}_t \\mid \\mathbf{x}_0)<\/annotation><\/semantics><\/math> as DDPM, and therefore the <em>same<\/em> training objective and the <em>same<\/em> trained network are valid.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The DDIM sampler<\/h3>\n\n\n\n<p class=\"\">Plugging in the network&#8217;s prediction <math><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mn>0<\/mn><\/msub><mo>=<\/mo><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2212<\/mo><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\hat{\\mathbf{x}}_0 = (\\mathbf{x}_t &#8211; \\sqrt{1-\\bar{\\alpha}_t}\\,\\boldsymbol{\\epsilon}_\\theta)\/\\sqrt{\\bar{\\alpha}_t}<\/annotation><\/semantics><\/math> in place of the true <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math>, we obtain the <strong>DDIM update<\/strong> in a very transparent form:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><munder><munder><mrow><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mn>0<\/mn><\/msub><\/mrow><mo stretchy=\"true\">\u23df<\/mo><\/munder><mrow><mtext>signal\u00a0at\u00a0level\u00a0<\/mtext><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mstyle><\/mrow><\/munder><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><munder><munder><mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2212<\/mo><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><\/mrow><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><\/mrow><mo stretchy=\"true\">\u23df<\/mo><\/munder><mtext>noise\u00a0direction,\u00a0scaled<\/mtext><\/munder><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><munder><munder><mrow><msub><mi>\u03c3<\/mi><mi>t<\/mi><\/msub><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">z<\/mi><\/mrow><mo stretchy=\"true\">\u23df<\/mo><\/munder><mtext>fresh\u00a0randomness<\/mtext><\/munder><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><mi mathvariant=\"bold\">z<\/mi><mo>\u223c<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_{t-1} \\;=\\; \\underbrace{\\sqrt{\\bar{\\alpha}_{t-1}}\\,\\hat{\\mathbf{x}}_0}_{\\text{signal at level $t-1$}} \\;+\\; \\underbrace{\\sqrt{1-\\bar{\\alpha}_{t-1}-\\sigma_t^2}\\,\\boldsymbol{\\epsilon}_\\theta}_{\\text{noise direction, scaled}} \\;+\\; \\underbrace{\\sigma_t\\,\\mathbf{z}}_{\\text{fresh randomness}},\\qquad \\mathbf{z}\\sim\\mathcal{N}(\\mathbf{0},\\mathbf{I}).<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Compare directly with the DDPM ancestral update,<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mn>1<\/mn><msqrt><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><\/msqrt><\/mfrac><mtext>\u2009\u2063<\/mtext><mrow><mo fence=\"true\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2212<\/mo><mfrac><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mfrac><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><mo fence=\"true\">)<\/mo><\/mrow><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mtext>DDPM<\/mtext><\/msubsup><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">z<\/mi><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_{t-1} \\;=\\; \\frac{1}{\\sqrt{\\alpha_t}}\\!\\left(\\mathbf{x}_t &#8211; \\frac{\\beta_t}{\\sqrt{1-\\bar{\\alpha}_t}}\\,\\boldsymbol{\\epsilon}_\\theta\\right) \\;+\\; \\sigma_t^{\\text{DDPM}}\\,\\mathbf{z},<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">and the structural difference becomes clear. The DDPM update is a <em>local<\/em> perturbation of <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> whose coefficients <math><semantics><mrow><mn>1<\/mn><mi mathvariant=\"normal\">\/<\/mi><msqrt><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">1\/\\sqrt{\\alpha_t}<\/annotation><\/semantics><\/math> and <math><semantics><mrow><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><mi mathvariant=\"normal\">\/<\/mi><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_t\/\\sqrt{1-\\bar{\\alpha}_t}<\/annotation><\/semantics><\/math> are calibrated for a single <math><semantics><mrow><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\Delta t = 1<\/annotation><\/semantics><\/math> Markov step \u2014 the step size is baked into the formula. The DDIM update instead <em>reconstructs<\/em> <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_{t-1}<\/annotation><\/semantics><\/math> from scratch by combining the predicted clean image <math><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\hat{\\mathbf{x}}_0<\/annotation><\/semantics><\/math> (rescaled to the target noise level via <math><semantics><mrow><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{\\bar{\\alpha}_{t-1}}<\/annotation><\/semantics><\/math>) with the predicted noise direction (rescaled by <math><semantics><mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2212<\/mo><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><\/mrow><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{1-\\bar{\\alpha}_{t-1}-\\sigma_t^2}<\/annotation><\/semantics><\/math>). The <em>target<\/em> time index <math><semantics><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t-1<\/annotation><\/semantics><\/math> enters the right-hand side only through the cumulative noise level <math><semantics><mrow><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\bar{\\alpha}_{t-1}<\/annotation><\/semantics><\/math>; the local step quantities <math><semantics><mrow><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_t<\/annotation><\/semantics><\/math> and <math><semantics><mrow><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\alpha_t<\/annotation><\/semantics><\/math> do not appear explicitly at all. (The source time <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> still enters, of course, through <math><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\hat{\\mathbf{x}}_0<\/annotation><\/semantics><\/math> and <math><semantics><mrow><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}_\\theta<\/annotation><\/semantics><\/math>, which depend on <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> and <math><semantics><mrow><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\bar{\\alpha}_t<\/annotation><\/semantics><\/math>.)<\/p>\n\n\n\n<p class=\"\">The stochasticity parameter <math><semantics><mrow><msub><mi>\u03c3<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_t<\/annotation><\/semantics><\/math> is conventionally written<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi>\u03c3<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03b7<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi>\u03b7<\/mi><mtext>\u2009<\/mtext><msqrt><mrow><mfrac><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mrow><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/mfrac><mtext>\u2009<\/mtext><mo fence=\"true\" maxsize=\"1.8em\" minsize=\"1.8em\" stretchy=\"true\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mfrac><\/mstyle><mo fence=\"true\" maxsize=\"1.8em\" minsize=\"1.8em\" stretchy=\"true\">)<\/mo><\/mrow><\/msqrt><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_t(\\eta) \\;=\\; \\eta\\,\\sqrt{\\frac{1-\\bar{\\alpha}_{t-1}}{1-\\bar{\\alpha}_t}\\,\\Bigl(1-\\tfrac{\\bar{\\alpha}_t}{\\bar{\\alpha}_{t-1}}\\Bigr)},<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">with <math><semantics><mrow><mi>\u03b7<\/mi><mo>\u2208<\/mo><mo stretchy=\"false\">[<\/mo><mn>0<\/mn><mo separator=\"true\">,<\/mo><mn>1<\/mn><mo stretchy=\"false\">]<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\eta\\in[0,1]<\/annotation><\/semantics><\/math> interpolating between two limits:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\"><math><semantics><mrow><mi>\u03b7<\/mi><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\eta = 1<\/annotation><\/semantics><\/math> recovers the DDPM ancestral sampler (up to a covariance choice equivalent to <math><semantics><mrow><msub><mover accent=\"true\"><mi>\u03b2<\/mi><mo>~<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\tilde{\\beta}_t<\/annotation><\/semantics><\/math>);<\/li>\n\n\n\n<li class=\"\"><math><semantics><mrow><mi>\u03b7<\/mi><mo>=<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\eta = 0<\/annotation><\/semantics><\/math> gives the <strong>deterministic DDIM sampler<\/strong>, the workhorse of the modern fast-sampling regime:<\/li>\n<\/ul>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mn>0<\/mn><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_{t-1} \\;=\\; \\sqrt{\\bar{\\alpha}_{t-1}}\\,\\hat{\\mathbf{x}}_0(\\mathbf{x}_t,t) \\;+\\; \\sqrt{1-\\bar{\\alpha}_{t-1}}\\,\\boldsymbol{\\epsilon}_\\theta(\\mathbf{x}_t,t).<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Why DDIM allows fewer sampling steps<\/h3>\n\n\n\n<p class=\"\">It is tempting to attribute DDIM&#8217;s step-skipping ability simply to &#8220;the trajectory is smooth.&#8221; That is part of the story but not the whole picture. Three observations, each with a clean intuitive content, together explain the effect.<\/p>\n\n\n\n<p class=\"\"><strong>(i) The update is anchored to a noise-level-agnostic prediction.<\/strong> Look again at the DDIM update: the right-hand side combines <math><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\hat{\\mathbf{x}}_0<\/annotation><\/semantics><\/math> (the network&#8217;s guess of the clean image) and <math><semantics><mrow><msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi>\u03b8<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}_\\theta<\/annotation><\/semantics><\/math> (the noise it sees) with coefficients <math><semantics><mrow><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>s<\/mi><\/msub><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{\\bar{\\alpha}_{s}}<\/annotation><\/semantics><\/math> and <math><semantics><mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>s<\/mi><\/msub><mo>\u2212<\/mo><msubsup><mi>\u03c3<\/mi><mi>s<\/mi><mn>2<\/mn><\/msubsup><\/mrow><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{1-\\bar{\\alpha}_{s}-\\sigma_s^2}<\/annotation><\/semantics><\/math> for <em>any<\/em> target noise level <math><semantics><mrow><mi>s<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">s<\/annotation><\/semantics><\/math> \u2014 and in the deterministic (<math><semantics><mrow><mi>\u03b7<\/mi><mo>=<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\eta = 0<\/annotation><\/semantics><\/math>) case, this collapses to the clean form <math><semantics><mrow><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>s<\/mi><\/msub><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{\\bar{\\alpha}_{s}}<\/annotation><\/semantics><\/math> and <math><semantics><mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>s<\/mi><\/msub><\/mrow><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{1-\\bar{\\alpha}_{s}}<\/annotation><\/semantics><\/math> displayed above. Crucially, <math><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\hat{\\mathbf{x}}_0<\/annotation><\/semantics><\/math> does not depend on which <math><semantics><mrow><mi>s<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">s<\/annotation><\/semantics><\/math> we want to jump to, it just depends on the current <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math>. So once the network has produced its best guess of the clean image, the formula can reassemble a sample at <em>any<\/em> lower noise level by simply remixing signal and noise in the proportions appropriate to that level. We can leap directly from <math><semantics><mrow><mi>t<\/mi><mo>=<\/mo><mn>1000<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t = 1000<\/annotation><\/semantics><\/math> to <math><semantics><mrow><mi>t<\/mi><mo>=<\/mo><mn>500<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t = 500<\/annotation><\/semantics><\/math> without visiting any intermediate state. (Song et al. formalise this in Section 4.2 \/ Appendix C.1 of the paper: the same update applies along any sub-sequence <math><semantics><mrow><mi>\u03c4<\/mi><mo>\u2282<\/mo><mo stretchy=\"false\">{<\/mo><mn>1<\/mn><mo separator=\"true\">,<\/mo><mo>\u2026<\/mo><mo separator=\"true\">,<\/mo><mi>T<\/mi><mo stretchy=\"false\">}<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\tau \\subset \\{1,\\dots,T\\}<\/annotation><\/semantics><\/math>, replacing <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(\\bar{\\alpha}_t, \\bar{\\alpha}_{t-1})<\/annotation><\/semantics><\/math> with <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><msub><mi>\u03c4<\/mi><mi>i<\/mi><\/msub><\/msub><mo separator=\"true\">,<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><msub><mi>\u03c4<\/mi><mrow><mi>i<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(\\bar{\\alpha}_{\\tau_i}, \\bar{\\alpha}_{\\tau_{i-1}})<\/annotation><\/semantics><\/math>.) Contrast this with the DDPM update, whose right-hand side contains <math><semantics><mrow><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><mi mathvariant=\"normal\">\/<\/mi><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_t\/\\sqrt{1-\\bar{\\alpha}_t}<\/annotation><\/semantics><\/math>. That coefficient encodes the size of a <em>single<\/em> small Markov step: it tells the sampler how much progress to make in one <math><semantics><mrow><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\Delta t = 1<\/annotation><\/semantics><\/math> increment. There is no analogous &#8220;leap to <math><semantics><mrow><mi>s<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">s<\/annotation><\/semantics><\/math>&#8221; knob, because the DDPM step size is hard-coded into the formula itself.<\/p>\n\n\n\n<p class=\"\"><strong>(ii) DDIM is a first-order discretisation of an ODE, not an SDE.<\/strong> In the deterministic <math><semantics><mrow><mi>\u03b7<\/mi><mo>=<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\eta = 0<\/annotation><\/semantics><\/math> case there is no fresh noise injection at each step, so the only source of error is the truncation error of the ODE solver, i.e. the discrepancy between the discrete update and the continuous probability-flow trajectory we will discuss in Section 3. Truncation error scales smoothly with step size and accumulates additively. SDE discretisations are different: each step injects fresh Gaussian noise whose variance scales with the step size, so larger SDE steps mean noisier samples in a way that no amount of network accuracy can fix. ODE solvers do not suffer from this.<\/p>\n\n\n\n<p class=\"\"><strong>(iii) The probability-flow trajectories tend to be relatively low-curvature.<\/strong> <a href=\"#ref-karras2022\">Karras et al. (2022)<\/a> later showed empirically that the deterministic trajectory from prior to data in diffusion models is approximately linear over much of its length, with most of the curvature concentrated near the data manifold. A nearly-straight trajectory is well-approximated by a few large Euler steps.<\/p>\n\n\n\n<p class=\"\">Together these three points explain the empirical observation that DDIM with 20\u201350 steps matches DDPM with 1000 steps at no cost to model retraining. DDIM also gives an <em>invertible<\/em> encoding from noise to data, which is the substrate of latent-space interpolation, editing, and exact likelihood evaluation via the change-of-variables formula.<\/p>\n\n\n\n<div class=\"bg-blue-50\/50 dark:bg-blue-900\/20 p-4 rounded-lg border border-blue-100 dark:border-blue-800 my-4\">\n<p><em>Intuition:<\/em> DDIM keeps the same marginals as DDPM but parameterises the reverse process around <math><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\hat{\\mathbf{x}}_0<\/annotation><\/semantics><\/math>, the <em>predicted clean image<\/em>, rather than around <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math>. Each step says &#8220;given my best guess of the clean image, where would I be at noise level <math><semantics><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t-1<\/annotation><\/semantics><\/math>?&#8221; This is invariant to step size: the formula is happy to jump to any target noise level. Removing the stochasticity (<math><semantics><mrow><mi>\u03b7<\/mi><mo>=<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\eta = 0<\/annotation><\/semantics><\/math>) on top of this turns sampling into a deterministic trajectory you can traverse with bigger strides.<\/p>\n<\/div>\n\n\n\n<div class=\"h-16\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">3. Score-Based Generative Modelling Through SDEs<\/h2>\n\n\n\n<p class=\"\">A few months after DDIM, <strong>[<a href=\"#ref-song2021b\">Song, Sohl-Dickstein, Kingma, Kumar, Ermon &amp; Poole (2021b)<\/a>](#ref-song2021b)<\/strong> published the paper that, in retrospect, unified everything that came before. Two parallel threads had been developing in the literature:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">The <strong>DDPM thread<\/strong> I described above, descending from variational hierarchical-VAE constructions and trained with the <math><semantics><mrow><mi>\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\epsilon<\/annotation><\/semantics><\/math>-prediction MSE.<\/li>\n\n\n\n<li class=\"\">The <strong>NCSN \/ SMLD thread<\/strong> initiated by <strong>[<a href=\"#ref-songermon2019\">Song &amp; Ermon (2019)<\/a>](#ref-songermon2019)<\/strong>, &#8220;Noise Conditional Score Network&#8221; and &#8220;Score Matching with Langevin Dynamics&#8221;, which trained a network <math><semantics><mrow><msub><mi mathvariant=\"bold\">s<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">,<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mo>\u2248<\/mo><msub><mi mathvariant=\"normal\">\u2207<\/mi><mi mathvariant=\"bold\">x<\/mi><\/msub><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>\u03c3<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{s}_\\theta(\\mathbf{x}, \\sigma) \\approx \\nabla_\\mathbf{x}\\log p_\\sigma(\\mathbf{x})<\/annotation><\/semantics><\/math> to estimate the <em>score function<\/em> of a noise-corrupted version of the data distribution at many noise scales <math><semantics><mrow><msub><mi>\u03c3<\/mi><mn>1<\/mn><\/msub><mo>&gt;<\/mo><msub><mi>\u03c3<\/mi><mn>2<\/mn><\/msub><mo>&gt;<\/mo><mo>\u22ef<\/mo><mo>&gt;<\/mo><msub><mi>\u03c3<\/mi><mi>L<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_1 &gt; \\sigma_2 &gt; \\dots &gt; \\sigma_L<\/annotation><\/semantics><\/math>. To sample, NCSN ran <em>annealed Langevin dynamics<\/em>: starting from a wide Gaussian, take some Langevin steps at the largest <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>, anneal to the next smaller <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>, repeat.<\/li>\n<\/ul>\n\n\n\n<p class=\"\">The training objective for NCSN was <em>denoising score matching<\/em> (<a href=\"#ref-hyvarinen2005\">Hyv\u00e4rinen, 2005<\/a>; <a href=\"#ref-vincent2011\">Vincent, 2011<\/a>), and the sampler was a discrete-time Langevin chain. <a href=\"#ref-song2021b\">Song et al. (2021b)<\/a> showed that both DDPM and NCSN are discretisations of the <em>same<\/em> continuous-time stochastic differential equation. This unification opened the door to better samplers, exact likelihood evaluation via the probability-flow ODE, and a host of downstream methods including guidance and inverse-problem solvers.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image.png?ssl=1\"><img decoding=\"async\" width=\"1734\" height=\"824\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image.png?fit=625%2C297&amp;ssl=1\" alt=\"\" class=\"wp-image-14289\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image.png?w=1734&amp;ssl=1 1734w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image.png?resize=300%2C143&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image.png?resize=1024%2C487&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image.png?resize=768%2C365&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image.png?resize=1536%2C730&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image.png?resize=624%2C297&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image.png?w=1250&amp;ssl=1 1250w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image_2.png?ssl=1\"><img decoding=\"async\" width=\"1502\" height=\"568\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image_2.png?fit=625%2C236&amp;ssl=1\" alt=\"\" class=\"wp-image-14291\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image_2.png?w=1502&amp;ssl=1 1502w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image_2.png?resize=300%2C113&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image_2.png?resize=1024%2C387&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image_2.png?resize=768%2C290&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image_2.png?resize=624%2C236&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/SDE_image_2.png?w=1250&amp;ssl=1 1250w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Forward SDE: setup and three regimes<\/h3>\n\n\n\n<p class=\"\">The forward corruption is now an <strong>It\u00f4 stochastic differential equation<\/strong> on <math><semantics><mrow><mi>t<\/mi><mo>\u2208<\/mo><mo stretchy=\"false\">[<\/mo><mn>0<\/mn><mo separator=\"true\">,<\/mo><mi>T<\/mi><mo stretchy=\"false\">]<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">t \\in [0, T]<\/annotation><\/semantics><\/math>:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi mathvariant=\"bold\">f<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">w<\/mi><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{d}\\mathbf{x} \\;=\\; \\mathbf{f}(\\mathbf{x},t)\\,\\mathrm{d}t \\;+\\; g(t)\\,\\mathrm{d}\\mathbf{w},<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">where <math><semantics><mrow><mi mathvariant=\"bold\">w<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{w}<\/annotation><\/semantics><\/math> is a standard <math><semantics><mrow><mi>d<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">d<\/annotation><\/semantics><\/math>-dimensional Wiener process (Brownian motion), <math><semantics><mrow><mi mathvariant=\"bold\">f<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{f}(\\mathbf{x},t)<\/annotation><\/semantics><\/math> is the <em>drift coefficient<\/em>, and <math><semantics><mrow><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">g(t)<\/annotation><\/semantics><\/math> is the <em>diffusion coefficient<\/em>. Three concrete choices of <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">f<\/mi><mo separator=\"true\">,<\/mo><mi>g<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(\\mathbf{f}, g)<\/annotation><\/semantics><\/math> recover existing models:<\/p>\n\n\n\n<p class=\"\"><strong>(a) Variance Preserving (VP-SDE)<\/strong>, the continuous-time limit of DDPM:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mo>\u2212<\/mo><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mn>1<\/mn><mn>2<\/mn><\/mfrac><\/mstyle><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">x<\/mi><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><mrow><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">w<\/mi><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{d}\\mathbf{x} \\;=\\; -\\tfrac{1}{2}\\beta(t)\\,\\mathbf{x}\\,\\mathrm{d}t \\;+\\; \\sqrt{\\beta(t)}\\,\\mathrm{d}\\mathbf{w}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">This is an inhomogeneous <strong>Ornstein\u2013Uhlenbeck (OU) process<\/strong>: a stochastic process whose drift is a linear <em>restoring force<\/em> pulling <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}<\/annotation><\/semantics><\/math> toward the origin (with strength <math><semantics><mrow><mfrac><mn>1<\/mn><mn>2<\/mn><\/mfrac><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\tfrac{1}{2}\\beta(t)<\/annotation><\/semantics><\/math>), and whose diffusion term injects isotropic noise (of strength <math><semantics><mrow><msqrt><mrow><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{\\beta(t)}<\/annotation><\/semantics><\/math>). The &#8220;pull toward the origin&#8221; can be read directly off the drift coefficient <math><semantics><mrow><mo>\u2212<\/mo><mfrac><mn>1<\/mn><mn>2<\/mn><\/mfrac><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">-\\tfrac{1}{2}\\beta(t)\\,\\mathbf{x}<\/annotation><\/semantics><\/math>: it always points in the direction <em>opposite<\/em> to the current state <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}<\/annotation><\/semantics><\/math>, with magnitude proportional to the distance from the origin. So if <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}<\/annotation><\/semantics><\/math> sits far out along some axis, the deterministic part of the dynamics pushes it back toward zero; if <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}<\/annotation><\/semantics><\/math> is already near the origin, the drift is small. This is the continuous-time analogue of the discrete forward step <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>+<\/mo><mo>\u2026<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t = \\sqrt{1-\\beta_t}\\,\\mathbf{x}_{t-1} + \\dots<\/annotation><\/semantics><\/math> from Section 1, where the factor <math><semantics><mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mo>&lt;<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{1-\\beta_t} &lt; 1<\/annotation><\/semantics><\/math> was shrinking the mean toward zero at each step. The asymptotic stationary distribution as <math><semantics><mrow><mi>t<\/mi><mo>\u2192<\/mo><mi mathvariant=\"normal\">\u221e<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t \\to \\infty<\/annotation><\/semantics><\/math> is <math><semantics><mrow><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathcal{N}(\\mathbf{0}, \\mathbf{I})<\/annotation><\/semantics><\/math>, the Gaussian prior we want.<\/p>\n\n\n\n<p class=\"\"><strong>(b) Variance Exploding (VE-SDE)<\/strong>, the continuous-time limit of NCSN \/ SMLD:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mrow><mi mathvariant=\"normal\">d<\/mi><mtext>\u2009<\/mtext><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><\/mrow><\/mfrac><\/mstyle><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">w<\/mi><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{d}\\mathbf{x} \\;=\\; \\sqrt{\\tfrac{\\mathrm{d}\\,\\sigma^2(t)}{\\mathrm{d}t}}\\,\\mathrm{d}\\mathbf{w}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Notice: <em>no drift<\/em>. The data is never rescaled; only noise is added, at a rate determined by the schedule <math><semantics><mrow><mi>\u03c3<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma(t)<\/annotation><\/semantics><\/math>. The marginal is <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t = \\mathbf{x}_0 + \\sigma(t)\\,\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>, and as <math><semantics><mrow><mi>\u03c3<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>\u2192<\/mo><mi mathvariant=\"normal\">\u221e<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma(t) \\to \\infty<\/annotation><\/semantics><\/math> the marginal becomes a very wide Gaussian centred at <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math>: the variance literally explodes. The oppposite process will therefore look like a &#8220;huge gaussian collapsing&#8221; into a new generated sample<\/p>\n\n\n\n<p class=\"\"><strong>(c) sub-VP-SDE<\/strong>, introduced in the same paper:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mo>\u2212<\/mo><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mn>1<\/mn><mn>2<\/mn><\/mfrac><\/mstyle><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">x<\/mi><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><mrow><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msup><mi>e<\/mi><mrow><mo>\u2212<\/mo><mn>2<\/mn><mtext>\u2009\u2063<\/mtext><msubsup><mo>\u222b<\/mo><mn>0<\/mn><mi>t<\/mi><\/msubsup><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>s<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mi>s<\/mi><\/mrow><\/msup><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><\/mrow><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">w<\/mi><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{d}\\mathbf{x} \\;=\\; -\\tfrac{1}{2}\\beta(t)\\,\\mathbf{x}\\,\\mathrm{d}t \\;+\\; \\sqrt{\\beta(t)\\bigl(1-e^{-2\\!\\int_0^t\\beta(s)\\,\\mathrm{d}s}\\bigr)}\\,\\mathrm{d}\\mathbf{w}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">This is a hybrid with variance strictly less than the VP-SDE at every <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math>, empirically giving better likelihoods on CIFAR-10. We will not dwell on it but it&#8217;s worth knowing it exists.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/VP_vs_VE.png?ssl=1\"><img decoding=\"async\" width=\"922\" height=\"366\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/VP_vs_VE.png?fit=922%2C366&amp;ssl=1\" alt=\"\" class=\"wp-image-14293\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/VP_vs_VE.png?w=922&amp;ssl=1 922w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/VP_vs_VE.png?resize=300%2C119&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/VP_vs_VE.png?resize=768%2C305&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/VP_vs_VE.png?resize=624%2C248&amp;ssl=1 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">DDPM is a discretisation of the VP-SDE<\/h3>\n\n\n\n<p class=\"\">The claim that DDPM is the VP-SDE in disguise is worth working out, because the algebra is short and clarifying. Discretise the VP-SDE with the Euler\u2013Maruyama scheme, the simplest first-order numerical scheme for SDEs, using a step <math><semantics><mrow><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\Delta t<\/annotation><\/semantics><\/math>:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>+<\/mo><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><\/msub><mo>\u2212<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mtext>\u2005\u200a<\/mtext><mo>\u2248<\/mo><mtext>\u2005\u200a<\/mtext><mo>\u2212<\/mo><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mn>1<\/mn><mn>2<\/mn><\/mfrac><\/mstyle><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><mrow><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_{t+\\Delta t} &#8211; \\mathbf{x}_t \\;\\approx\\; -\\tfrac{1}{2}\\beta(t)\\,\\mathbf{x}_t\\,\\Delta t \\;+\\; \\sqrt{\\beta(t)\\,\\Delta t}\\,\\boldsymbol{\\epsilon},<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">so<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>+<\/mo><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><\/msub><mtext>\u2005\u200a<\/mtext><mo>\u2248<\/mo><mtext>\u2005\u200a<\/mtext><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mn>1<\/mn><mn>2<\/mn><\/mfrac><\/mstyle><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><mrow><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_{t+\\Delta t} \\;\\approx\\; \\bigl(1 &#8211; \\tfrac{1}{2}\\beta(t)\\,\\Delta t\\bigr)\\,\\mathbf{x}_t \\;+\\; \\sqrt{\\beta(t)\\,\\Delta t}\\,\\boldsymbol{\\epsilon}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Now identify the <em>discrete<\/em> DDPM coefficients with the <em>continuous<\/em> SDE coefficients via <math><semantics><mrow><msubsup><mi>\u03b2<\/mi><mi>t<\/mi><mtext>disc<\/mtext><\/msubsup><mo>=<\/mo><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_t^{\\text{disc}} = \\beta(t)\\,\\Delta t<\/annotation><\/semantics><\/math> (i.e. the discrete <math><semantics><mrow><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_t<\/annotation><\/semantics><\/math> in DDPM is <math><semantics><mrow><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\beta(t)<\/annotation><\/semantics><\/math> times the time-step). For small <math><semantics><mrow><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_t<\/annotation><\/semantics><\/math> we have the Taylor expansion <math><semantics><mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mo>\u2248<\/mo><mn>1<\/mn><mo>\u2212<\/mo><mfrac><mn>1<\/mn><mn>2<\/mn><\/mfrac><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{1 &#8211; \\beta_t} \\approx 1 &#8211; \\tfrac{1}{2}\\beta_t<\/annotation><\/semantics><\/math>, so<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mn>1<\/mn><mn>2<\/mn><\/mfrac><\/mstyle><msubsup><mi>\u03b2<\/mi><mi>t<\/mi><mtext>disc<\/mtext><\/msubsup><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mtext>\u2005\u200a<\/mtext><mo>\u2248<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msubsup><mi>\u03b2<\/mi><mi>t<\/mi><mtext>disc<\/mtext><\/msubsup><\/mrow><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\bigl(1 &#8211; \\tfrac{1}{2}\\beta_t^{\\text{disc}}\\bigr)\\,\\mathbf{x}_t \\;\\approx\\; \\sqrt{1 &#8211; \\beta_t^{\\text{disc}}}\\,\\mathbf{x}_t.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">The Euler\u2013Maruyama discretisation thus reads<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>+<\/mo><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><\/msub><mtext>\u2005\u200a<\/mtext><mo>\u2248<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msubsup><mi>\u03b2<\/mi><mi>t<\/mi><mtext>disc<\/mtext><\/msubsup><\/mrow><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><msubsup><mi>\u03b2<\/mi><mi>t<\/mi><mtext>disc<\/mtext><\/msubsup><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_{t+\\Delta t} \\;\\approx\\; \\sqrt{1 &#8211; \\beta_t^{\\text{disc}}}\\,\\mathbf{x}_t \\;+\\; \\sqrt{\\beta_t^{\\text{disc}}}\\,\\boldsymbol{\\epsilon},<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">which is <strong>exactly<\/strong> the DDPM forward step <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msqrt><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>+<\/mo><msqrt><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t = \\sqrt{\\alpha_t}\\,\\mathbf{x}_{t-1} + \\sqrt{\\beta_t}\\,\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> from Section 1. So DDPM is <em>literally<\/em> the Euler\u2013Maruyama discretisation of the VP-SDE with <math><semantics><mrow><msubsup><mi>\u03b2<\/mi><mi>t<\/mi><mtext>disc<\/mtext><\/msubsup><mo>=<\/mo><mi>\u03b2<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_t^{\\text{disc}} = \\beta(t)\\,\\Delta t<\/annotation><\/semantics><\/math>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Why is VP-SDE &#8220;variance preserving&#8221;?<\/h3>\n\n\n\n<p class=\"\">Take the scalar case for clarity and compute the conditional variance recursively. From the forward step <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>+<\/mo><msqrt><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t = \\sqrt{1-\\beta_t}\\,\\mathbf{x}_{t-1} + \\sqrt{\\beta_t}\\,\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> with <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo>\u223c<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn>0<\/mn><mo separator=\"true\">,<\/mo><mn>1<\/mn><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}\\sim\\mathcal{N}(0,1)<\/annotation><\/semantics><\/math> independent of <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_{t-1}<\/annotation><\/semantics><\/math>:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mrow><mi mathvariant=\"normal\">V<\/mi><mi mathvariant=\"normal\">a<\/mi><mi mathvariant=\"normal\">r<\/mi><\/mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mrow><mi mathvariant=\"normal\">V<\/mi><mi mathvariant=\"normal\">a<\/mi><mi mathvariant=\"normal\">r<\/mi><\/mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{Var}(\\mathbf{x}_t) \\;=\\; (1-\\beta_t)\\,\\mathrm{Var}(\\mathbf{x}_{t-1}) \\;+\\; \\beta_t.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">If <math><semantics><mrow><mrow><mi mathvariant=\"normal\">V<\/mi><mi mathvariant=\"normal\">a<\/mi><mi mathvariant=\"normal\">r<\/mi><\/mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{Var}(\\mathbf{x}_{t-1}) = 1<\/annotation><\/semantics><\/math>, then <math><semantics><mrow><mrow><mi mathvariant=\"normal\">V<\/mi><mi mathvariant=\"normal\">a<\/mi><mi mathvariant=\"normal\">r<\/mi><\/mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mo>\u22c5<\/mo><mn>1<\/mn><mo>+<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{Var}(\\mathbf{x}_t) = (1-\\beta_t)\\cdot 1 + \\beta_t = 1<\/annotation><\/semantics><\/math>. So once the variance reaches one, it stays exactly one, independently of the schedule <math><semantics><mrow><mo stretchy=\"false\">{<\/mo><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">}<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\{\\beta_t\\}<\/annotation><\/semantics><\/math>. This is the precise sense in which the VP-SDE <em>preserves<\/em> variance: the marginal variance is a fixed point of the forward dynamics at value 1. The forward process keeps every <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> in roughly the same magnitude range as the data, which is convenient numerically.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How VE-SDE is the continuous limit of NCSN\/SMLD<\/h3>\n\n\n\n<p class=\"\">Now to the variance-exploding side. NCSN\/SMLD as originally constructed (<a href=\"#ref-songermon2019\">Song &amp; Ermon, 2019<\/a>) trains a network at <math><semantics><mrow><mi>L<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">L<\/annotation><\/semantics><\/math> <em>discrete<\/em> noise scales <math><semantics><mrow><msub><mi>\u03c3<\/mi><mn>1<\/mn><\/msub><mo>&gt;<\/mo><msub><mi>\u03c3<\/mi><mn>2<\/mn><\/msub><mo>&gt;<\/mo><mo>\u22ef<\/mo><mo>&gt;<\/mo><msub><mi>\u03c3<\/mi><mi>L<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_1 &gt; \\sigma_2 &gt; \\dots &gt; \\sigma_L<\/annotation><\/semantics><\/math>: at scale <math><semantics><mrow><msub><mi>\u03c3<\/mi><mi>i<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_i<\/annotation><\/semantics><\/math> the &#8220;data&#8221; is <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo>+<\/mo><msub><mi>\u03c3<\/mi><mi>i<\/mi><\/msub><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x} + \\sigma_i\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>, with <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}<\/annotation><\/semantics><\/math> drawn from the dataset. The network outputs an estimate of the score <math><semantics><mrow><msub><mi mathvariant=\"normal\">\u2207<\/mi><mi mathvariant=\"bold\">x<\/mi><\/msub><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><msub><mi>\u03c3<\/mi><mi>i<\/mi><\/msub><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\nabla_\\mathbf{x}\\log p_{\\sigma_i}(\\mathbf{x})<\/annotation><\/semantics><\/math> at each scale, where <math><semantics><mrow><msub><mi>p<\/mi><mi>\u03c3<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_\\sigma<\/annotation><\/semantics><\/math> denotes the data distribution convolved with <math><semantics><mrow><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathcal{N}(\\mathbf{0}, \\sigma^2\\mathbf{I})<\/annotation><\/semantics><\/math>.<\/p>\n\n\n\n<p class=\"\">To pass to the continuous-time version, let the discrete scales become a continuous schedule <math><semantics><mrow><mi>\u03c3<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma(t)<\/annotation><\/semantics><\/math> and ask: which SDE produces marginals <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t = \\mathbf{x}_0 + \\sigma(t)\\,\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> where <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo>\u223c<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}\\sim\\mathcal{N}(\\mathbf{0},\\mathbf{I})<\/annotation><\/semantics><\/math>?<\/p>\n\n\n\n<p class=\"\">Since the marginal mean is <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> (constant in <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math>), the drift must be zero: <math><semantics><mrow><mi mathvariant=\"bold\">f<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{f}(\\mathbf{x},t) = 0<\/annotation><\/semantics><\/math>. The marginal variance is <math><semantics><mrow><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma^2(t)<\/annotation><\/semantics><\/math>, and for an SDE with zero drift and diffusion <math><semantics><mrow><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">g(t)<\/annotation><\/semantics><\/math>, the variance evolves as<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mfrac><mi mathvariant=\"normal\">d<\/mi><mrow><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><\/mrow><\/mfrac><mrow><mi mathvariant=\"normal\">V<\/mi><mi mathvariant=\"normal\">a<\/mi><mi mathvariant=\"normal\">r<\/mi><\/mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><msup><mo stretchy=\"false\">)<\/mo><mn>2<\/mn><\/msup><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\frac{\\mathrm{d}}{\\mathrm{d}t}\\mathrm{Var}(\\mathbf{x}_t) \\;=\\; g(t)^2.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Setting <math><semantics><mrow><mrow><mi mathvariant=\"normal\">V<\/mi><mi mathvariant=\"normal\">a<\/mi><mi mathvariant=\"normal\">r<\/mi><\/mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{Var}(\\mathbf{x}_t) = \\sigma^2(t)<\/annotation><\/semantics><\/math> gives <math><semantics><mrow><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><msup><mo stretchy=\"false\">)<\/mo><mn>2<\/mn><\/msup><mo>=<\/mo><mfrac><mrow><mi mathvariant=\"normal\">d<\/mi><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><\/mrow><\/mfrac><\/mrow><annotation encoding=\"application\/x-tex\">g(t)^2 = \\tfrac{\\mathrm{d}\\sigma^2(t)}{\\mathrm{d}t}<\/annotation><\/semantics><\/math>, hence<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><mfrac><mrow><mi mathvariant=\"normal\">d<\/mi><mtext>\u2009<\/mtext><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><\/mrow><\/mfrac><\/msqrt><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">g(t) \\;=\\; \\sqrt{\\frac{\\mathrm{d}\\,\\sigma^2(t)}{\\mathrm{d}t}}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">This is exactly the VE-SDE coefficient. The corresponding SDE is<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mrow><mi mathvariant=\"normal\">d<\/mi><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><\/mrow><\/mfrac><\/mstyle><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">w<\/mi><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{d}\\mathbf{x} \\;=\\; \\sqrt{\\tfrac{\\mathrm{d}\\sigma^2(t)}{\\mathrm{d}t}}\\,\\mathrm{d}\\mathbf{w},<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">with marginal <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t = \\mathbf{x}_0 + \\sigma(t)\\,\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>, exactly the family of noise-corrupted distributions that NCSN trains on. So VE-SDE is the continuous schedule limit of NCSN&#8217;s discrete noise-scale construction: in the same way DDPM is the continuous limit of VP, NCSN is the continuous limit of VE.<\/p>\n\n\n\n<p class=\"\">The term &#8220;variance exploding&#8221; refers to the divergence of <math><semantics><mrow><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma^2(t)<\/annotation><\/semantics><\/math> at large <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math>: in standard NCSN setups <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> is geometric with <math><semantics><mrow><msub><mi>\u03c3<\/mi><mi>max<\/mi><mo>\u2061<\/mo><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_{\\max}<\/annotation><\/semantics><\/math> in the dozens or hundreds, so the marginal <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> at late time has variance much greater than 1. In contrast, VP keeps variance bounded at 1.<\/p>\n\n\n\n<div class=\"bg-muted\/50 p-6 rounded-lg my-6 border border-border\">\n<h3 id=\"numerical-example-vp-vs-ve-at-matched-snr\">Numerical example (VP vs VE at matched SNR)<\/h3>\n<p>The cleanest way to see the geometric difference between VP and VE is to match them at the same <em>signal-to-noise ratio<\/em> (SNR). For each, with <math><semantics><mrow><msub><mi>\u03c3<\/mi><mtext>data<\/mtext><\/msub><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_{\\text{data}} = 1<\/annotation><\/semantics><\/math>:<\/p>\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mrow><mi mathvariant=\"normal\">S<\/mi><mi mathvariant=\"normal\">N<\/mi><mi mathvariant=\"normal\">R<\/mi><\/mrow><mtext>VP<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/mfrac><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><msub><mrow><mi mathvariant=\"normal\">S<\/mi><mi mathvariant=\"normal\">N<\/mi><mi mathvariant=\"normal\">R<\/mi><\/mrow><mtext>VE<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mn>1<\/mn><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><\/mfrac><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{SNR}_{\\text{VP}}(t) \\;=\\; \\frac{\\bar{\\alpha}_t}{1-\\bar{\\alpha}_t},\\qquad \\mathrm{SNR}_{\\text{VE}}(\\sigma) \\;=\\; \\frac{1}{\\sigma^2}.<\/annotation><\/semantics><\/math><\/div>\n<p>Setting these equal gives the conversion <math><semantics><mrow><msub><mi>\u03c3<\/mi><mtext>VE<\/mtext><\/msub><mo>=<\/mo><msqrt><mrow><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><\/mrow><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_{\\text{VE}} = \\sqrt{(1-\\bar{\\alpha})\/\\bar{\\alpha}}<\/annotation><\/semantics><\/math>. Picking three noise levels and applying the same <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>=<\/mo><mn>1.0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0 = 1.0<\/annotation><\/semantics><\/math>, <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo>=<\/mo><mn>0.5<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon} = 0.5<\/annotation><\/semantics><\/math>:<\/p>\n<table><thead><tr><th align=\"right\">Noise level (SNR)<\/th><th align=\"right\"><math><semantics><mrow><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><\/mrow><annotation encoding=\"application\/x-tex\">\\bar{\\alpha}<\/annotation><\/semantics><\/math><\/th><th align=\"right\"><math><semantics><mrow><msub><mi>\u03c3<\/mi><mtext>VE<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_{\\text{VE}}<\/annotation><\/semantics><\/math><\/th><th align=\"right\"><math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> under VP<\/th><th align=\"right\"><math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> under VE<\/th><th align=\"right\"><math><semantics><mrow><mrow><mi mathvariant=\"normal\">V<\/mi><mi mathvariant=\"normal\">a<\/mi><mi mathvariant=\"normal\">r<\/mi><\/mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{Var}(\\mathbf{x}_t\\mid \\mathbf{x}_0)<\/annotation><\/semantics><\/math> VP<\/th><th align=\"right\"><math><semantics><mrow><mrow><mi mathvariant=\"normal\">V<\/mi><mi mathvariant=\"normal\">a<\/mi><mi mathvariant=\"normal\">r<\/mi><\/mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{Var}(\\mathbf{x}_t\\mid \\mathbf{x}_0)<\/annotation><\/semantics><\/math> VE<\/th><\/tr><\/thead><tbody><tr><td align=\"right\">Low (SNR=9)<\/td><td align=\"right\">0.9<\/td><td align=\"right\">0.333<\/td><td align=\"right\"><math><semantics><mrow><msqrt><mn>0.9<\/mn><\/msqrt><mo>\u22c5<\/mo><mn>1<\/mn><mo>+<\/mo><msqrt><mn>0.1<\/mn><\/msqrt><mo>\u22c5<\/mo><mn>0.5<\/mn><mo>=<\/mo><mn>1.107<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{0.9}\\cdot 1 + \\sqrt{0.1}\\cdot 0.5 = 1.107<\/annotation><\/semantics><\/math><\/td><td align=\"right\"><math><semantics><mrow><mn>1<\/mn><mo>+<\/mo><mn>0.333<\/mn><mo>\u22c5<\/mo><mn>0.5<\/mn><mo>=<\/mo><mn>1.167<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">1 + 0.333\\cdot 0.5 = 1.167<\/annotation><\/semantics><\/math><\/td><td align=\"right\">0.10<\/td><td align=\"right\">0.111<\/td><\/tr><tr><td align=\"right\">Mid (SNR=1)<\/td><td align=\"right\">0.5<\/td><td align=\"right\">1.000<\/td><td align=\"right\"><math><semantics><mrow><msqrt><mn>0.5<\/mn><\/msqrt><mo>\u22c5<\/mo><mn>1<\/mn><mo>+<\/mo><msqrt><mn>0.5<\/mn><\/msqrt><mo>\u22c5<\/mo><mn>0.5<\/mn><mo>=<\/mo><mn>1.061<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{0.5}\\cdot 1 + \\sqrt{0.5}\\cdot 0.5 = 1.061<\/annotation><\/semantics><\/math><\/td><td align=\"right\"><math><semantics><mrow><mn>1<\/mn><mo>+<\/mo><mn>1.000<\/mn><mo>\u22c5<\/mo><mn>0.5<\/mn><mo>=<\/mo><mn>1.500<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">1 + 1.000\\cdot 0.5 = 1.500<\/annotation><\/semantics><\/math><\/td><td align=\"right\">0.50<\/td><td align=\"right\">1.000<\/td><\/tr><tr><td align=\"right\">High (SNR=1\/9)<\/td><td align=\"right\">0.1<\/td><td align=\"right\">3.000<\/td><td align=\"right\"><math><semantics><mrow><msqrt><mn>0.1<\/mn><\/msqrt><mo>\u22c5<\/mo><mn>1<\/mn><mo>+<\/mo><msqrt><mn>0.9<\/mn><\/msqrt><mo>\u22c5<\/mo><mn>0.5<\/mn><mo>=<\/mo><mn>0.791<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{0.1}\\cdot 1 + \\sqrt{0.9}\\cdot 0.5 = 0.791<\/annotation><\/semantics><\/math><\/td><td align=\"right\"><math><semantics><mrow><mn>1<\/mn><mo>+<\/mo><mn>3.000<\/mn><mo>\u22c5<\/mo><mn>0.5<\/mn><mo>=<\/mo><mn>2.500<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">1 + 3.000\\cdot 0.5 = 2.500<\/annotation><\/semantics><\/math><\/td><td align=\"right\">0.90<\/td><td align=\"right\">9.000<\/td><\/tr><\/tbody><\/table>\n<p>Reading the table left-to-right: at every row, VP and VE encode the <em>same amount of information<\/em> about <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> (same SNR), but they live on completely different scales. VP keeps the magnitude of <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> bounded near 1 across all noise levels. VE lets the magnitude grow without bound, at high noise the data is a vanishing perturbation of a wide Gaussian. Same <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>, same information content, drastically different geometry.<\/p>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Reverse-time SDE and the score function<\/h3>\n\n\n\n<p class=\"\">The fundamental theoretical result of <a href=\"#ref-song2021b\">Song et al. (2021b)<\/a> is the reversibility of the forward SDE. Due to <strong><a href=\"#ref-anderson1982\">Anderson (1982)<\/a><\/strong>, every It\u00f4 diffusion admits a <em>reverse-time<\/em> companion whose drift is the original drift minus the score of the time-marginal:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">[<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">f<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mo>\u2212<\/mo><mtext>\u2009<\/mtext><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><msup><mo stretchy=\"false\">)<\/mo><mn>2<\/mn><\/msup><mtext>\u2009<\/mtext><msub><mi mathvariant=\"normal\">\u2207<\/mi><mrow><mtext>\u2009\u2063<\/mtext><mi mathvariant=\"bold\">x<\/mi><\/mrow><\/msub><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">]<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mover accent=\"true\"><mi mathvariant=\"bold\">w<\/mi><mo>\u02c9<\/mo><\/mover><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{d}\\mathbf{x} \\;=\\; \\bigl[\\,\\mathbf{f}(\\mathbf{x},t) \\,-\\, g(t)^2\\,\\nabla_{\\!\\mathbf{x}}\\log p_t(\\mathbf{x})\\,\\bigr]\\,\\mathrm{d}t \\;+\\; g(t)\\,\\mathrm{d}\\bar{\\mathbf{w}},<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">where <math><semantics><mrow><mover accent=\"true\"><mi mathvariant=\"bold\">w<\/mi><mo>\u02c9<\/mo><\/mover><\/mrow><annotation encoding=\"application\/x-tex\">\\bar{\\mathbf{w}}<\/annotation><\/semantics><\/math> is a <em>reverse-time<\/em> Brownian motion. If we can estimate the score <math><semantics><mrow><msub><mi mathvariant=\"normal\">\u2207<\/mi><mi mathvariant=\"bold\">x<\/mi><\/msub><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\nabla_\\mathbf{x}\\log p_t(\\mathbf{x})<\/annotation><\/semantics><\/math> at every <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> (call this estimate <math><semantics><mrow><msub><mi mathvariant=\"bold\">s<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{s}_\\theta(\\mathbf{x},t)<\/annotation><\/semantics><\/math>), we can simulate the reverse-time SDE numerically (Euler\u2013Maruyama again) and produce a sample from <math><semantics><mrow><msub><mi>p<\/mi><mn>0<\/mn><\/msub><mo>\u2248<\/mo><msub><mi>p<\/mi><mtext>data<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_0 \\approx p_{\\text{data}}<\/annotation><\/semantics><\/math> starting from <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>T<\/mi><\/msub><mo>\u223c<\/mo><msub><mi>p<\/mi><mi>T<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_T\\sim p_T<\/annotation><\/semantics><\/math>.<\/p>\n\n\n\n<p class=\"\">The score is learned by <strong>denoising score matching<\/strong> (<a href=\"#ref-vincent2011\">Vincent, 2011<\/a>). The loss is<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"script\">L<\/mi><mtext>DSM<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03b8<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi mathvariant=\"double-struck\">E<\/mi><mrow><mi>t<\/mi><mo separator=\"true\">,<\/mo><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo separator=\"true\">,<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><\/msub><mo fence=\"true\" maxsize=\"1.8em\" minsize=\"1.8em\" stretchy=\"true\">[<\/mo><mi>\u03bb<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">\u2225<\/mo><msub><mi mathvariant=\"bold\">s<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>\u2212<\/mo><msub><mi mathvariant=\"normal\">\u2207<\/mi><mrow><mtext>\u2009\u2063<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><\/msub><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><msup><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">\u2225<\/mo><mn>2<\/mn><\/msup><mo fence=\"true\" maxsize=\"1.8em\" minsize=\"1.8em\" stretchy=\"true\">]<\/mo><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathcal{L}_{\\text{DSM}}(\\theta) \\;=\\; \\mathbb{E}_{t,\\,\\mathbf{x}_0,\\,\\boldsymbol{\\epsilon}}\\Bigl[\\lambda(t)\\,\\bigl\\lVert\\mathbf{s}_\\theta(\\mathbf{x}_t,t) &#8211; \\nabla_{\\!\\mathbf{x}_t}\\log p_t(\\mathbf{x}_t\\mid\\mathbf{x}_0)\\bigr\\rVert^2\\Bigr].<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Because <math><semantics><mrow><msub><mi>p<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">p_t(\\mathbf{x}_t\\mid\\mathbf{x}_0)<\/annotation><\/semantics><\/math> is a Gaussian with closed form, the conditional score is analytic. For VP at time <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math>, <math><semantics><mrow><msub><mi>p<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo separator=\"true\">,<\/mo><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">p_t(\\mathbf{x}_t\\mid\\mathbf{x}_0) = \\mathcal{N}(\\sqrt{\\bar{\\alpha}_t}\\,\\mathbf{x}_0, (1-\\bar{\\alpha}_t)\\mathbf{I})<\/annotation><\/semantics><\/math>, so<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"normal\">\u2207<\/mi><mrow><mtext>\u2009\u2063<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><\/msub><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2223<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mo>\u2212<\/mo><mtext>\u2009<\/mtext><mfrac><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>\u2212<\/mo><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/mfrac><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mo>\u2212<\/mo><mtext>\u2009<\/mtext><mfrac><mi mathvariant=\"bold-italic\">\u03f5<\/mi><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mfrac><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\nabla_{\\!\\mathbf{x}_t}\\log p_t(\\mathbf{x}_t\\mid\\mathbf{x}_0) \\;=\\; -\\,\\frac{\\mathbf{x}_t &#8211; \\sqrt{\\bar{\\alpha}_t}\\,\\mathbf{x}_0}{1-\\bar{\\alpha}_t} \\;=\\; -\\,\\frac{\\boldsymbol{\\epsilon}}{\\sqrt{1-\\bar{\\alpha}_t}}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">This is the exact mathematical bridge to DDPM: predicting the noise <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> and predicting the score <math><semantics><mrow><msub><mi mathvariant=\"normal\">\u2207<\/mi><mrow><mtext>\u2009\u2063<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><\/msub><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\nabla_{\\!\\mathbf{x}_t}\\log p_t<\/annotation><\/semantics><\/math> are the <em>same regression problem<\/em> up to a fixed scalar <math><semantics><mrow><mo>\u2212<\/mo><mn>1<\/mn><mi mathvariant=\"normal\">\/<\/mi><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">-1\/\\sqrt{1-\\bar{\\alpha}_t}<\/annotation><\/semantics><\/math>. The DDPM simple loss and the NCSN denoising score matching loss differ only in the constant in front of the MSE, which is absorbed into the weighting function <math><semantics><mrow><mi>\u03bb<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\lambda(t)<\/annotation><\/semantics><\/math>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Probability-flow ODE: same marginals, no noise<\/h3>\n\n\n\n<p class=\"\">A deep observation from <a href=\"#ref-song2021b\">Song et al. (2021b)<\/a>: the family of stochastic processes that share the <em>same marginals<\/em> <math><semantics><mrow><msub><mi>p<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_t<\/annotation><\/semantics><\/math> as the reverse SDE is <em>not unique<\/em>. There is a privileged deterministic member, the <strong>probability-flow ODE<\/strong>:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mfrac><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><\/mrow><\/mfrac><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi mathvariant=\"bold\">f<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>\u2212<\/mo><mtext>\u2005\u200a<\/mtext><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mn>1<\/mn><mn>2<\/mn><\/mfrac><\/mstyle><mtext>\u2009<\/mtext><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><msup><mo stretchy=\"false\">)<\/mo><mn>2<\/mn><\/msup><mtext>\u2009<\/mtext><msub><mi mathvariant=\"normal\">\u2207<\/mi><mrow><mtext>\u2009\u2063<\/mtext><mi mathvariant=\"bold\">x<\/mi><\/mrow><\/msub><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\frac{\\mathrm{d}\\mathbf{x}}{\\mathrm{d}t} \\;=\\; \\mathbf{f}(\\mathbf{x},t) \\;-\\; \\tfrac{1}{2}\\,g(t)^2\\,\\nabla_{\\!\\mathbf{x}}\\log p_t(\\mathbf{x}).<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Solving this ODE <em>backward in time<\/em> from <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>T<\/mi><\/msub><mo>\u223c<\/mo><msub><mi>p<\/mi><mi>T<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_T\\sim p_T<\/annotation><\/semantics><\/math> to <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> produces a sample with the same marginal distribution at every <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> as the reverse SDE, but along a <em>deterministic, invertible trajectory<\/em>. The probability-flow ODE:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">underlies DDIM, which is (up to a notational change) its first-order Euler discretisation in the VP regime;<\/li>\n\n\n\n<li class=\"\">enables exact log-likelihoods through the instantaneous change-of-variables formula <math><semantics><mrow><msub><mi mathvariant=\"normal\">\u2202<\/mi><mi>t<\/mi><\/msub><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mo>\u2212<\/mo><mi mathvariant=\"normal\">\u2207<\/mi><mo>\u22c5<\/mo><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><mi mathvariant=\"normal\">\/<\/mi><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\partial_t \\log p_t(\\mathbf{x}_t) = -\\nabla\\cdot\\bigl(\\mathrm{d}\\mathbf{x}\/\\mathrm{d}t\\bigr)<\/annotation><\/semantics><\/math>;<\/li>\n\n\n\n<li class=\"\">is the substrate on which EDM will later build.<\/li>\n<\/ul>\n\n\n\n<div class=\"bg-blue-50\/50 dark:bg-blue-900\/20 p-4 rounded-lg border border-blue-100 dark:border-blue-800 my-4\">\n<p><em>Intuition (score as force).<\/em> The score <math><semantics><mrow><msub><mi mathvariant=\"normal\">\u2207<\/mi><mi mathvariant=\"bold\">x<\/mi><\/msub><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\nabla_\\mathbf{x}\\log p_t(\\mathbf{x})<\/annotation><\/semantics><\/math> is the gradient of the log-density: it points <em>uphill<\/em> in <math><semantics><mrow><msub><mi>p<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_t<\/annotation><\/semantics><\/math>, toward higher-density regions, i.e. toward the data manifold. In the reverse SDE, the term <math><semantics><mrow><mo>\u2212<\/mo><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><msup><mo stretchy=\"false\">)<\/mo><mn>2<\/mn><\/msup><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u2207<\/mi><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">-g(t)^2\\,\\nabla\\log p_t<\/annotation><\/semantics><\/math> in the drift is therefore an <em>attractive force<\/em> pulling the sample toward the data manifold. The Brownian term <math><semantics><mrow><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mover accent=\"true\"><mi mathvariant=\"bold\">w<\/mi><mo>\u02c9<\/mo><\/mover><\/mrow><annotation encoding=\"application\/x-tex\">g(t)\\,\\mathrm{d}\\bar{\\mathbf{w}}<\/annotation><\/semantics><\/math> is <em>exploratory<\/em>: it injects random fluctuations that prevent the sample from collapsing onto any particular trajectory. The balance between these two, attraction toward data versus stochastic exploration, is exactly the <em>exploration\u2013exploitation<\/em> tradeoff familiar from Langevin MCMC and from many sampling algorithms in statistical physics. The probability-flow ODE keeps the attraction and drops the exploration entirely, giving a deterministic trajectory that is the &#8220;expected&#8221; reverse path.<\/p>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Predictor\u2013corrector sampling and Langevin correctors<\/h3>\n\n\n\n<p class=\"\">A practical contribution of <a href=\"#ref-song2021b\">Song et al. (2021b)<\/a> that I find under-appreciated is <strong>predictor\u2013corrector (PC) sampling<\/strong>. The motivation is that any numerical SDE solver makes errors at each step: the discretised sample at time <math><semantics><mrow><mi>t<\/mi><mo>\u2212<\/mo><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t-\\Delta t<\/annotation><\/semantics><\/math> has a distribution that drifts from the true marginal <math><semantics><mrow><msub><mi>p<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_{t-\\Delta t}<\/annotation><\/semantics><\/math>. Over many steps these errors compound. The idea of PC is to <em>correct<\/em> the sample&#8217;s distribution back onto the true <math><semantics><mrow><msub><mi>p<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_{t-\\Delta t}<\/annotation><\/semantics><\/math> after each predictor step by running a small number of MCMC steps targeting <math><semantics><mrow><msub><mi>p<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_{t-\\Delta t}<\/annotation><\/semantics><\/math>, using the same learnt score we already have.<\/p>\n\n\n\n<p class=\"\">The corrector of choice is <strong>Langevin MCMC<\/strong> (a.k.a. unadjusted Langevin algorithm or ULA). To sample from a density <math><semantics><mrow><mi>p<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">p<\/annotation><\/semantics><\/math>, Langevin dynamics is the SDE<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mn>1<\/mn><mn>2<\/mn><\/mfrac><\/mstyle><mi mathvariant=\"normal\">\u2207<\/mi><mtext>\u2009\u2063<\/mtext><mi>log<\/mi><mo>\u2061<\/mo><mi>p<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">w<\/mi><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{d}\\mathbf{x} \\;=\\; \\tfrac{1}{2}\\nabla\\!\\log p(\\mathbf{x})\\,\\mathrm{d}t \\;+\\; \\mathrm{d}\\mathbf{w}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Its stationary distribution is <math><semantics><mrow><mi>p<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">p<\/annotation><\/semantics><\/math>. Discretising with the Euler\u2013Maruyama scheme and a step size <math><semantics><mrow><mi>\u03b4<\/mi><mo>&gt;<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\delta &gt; 0<\/annotation><\/semantics><\/math>,<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mtext>\u2005\u200a<\/mtext><mo>\u2190<\/mo><mtext>\u2005\u200a<\/mtext><mi mathvariant=\"bold\">x<\/mi><mo>+<\/mo><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mi>\u03b4<\/mi><mn>2<\/mn><\/mfrac><\/mstyle><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u2207<\/mi><mtext>\u2009\u2063<\/mtext><mi>log<\/mi><mo>\u2061<\/mo><mi>p<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msqrt><mi>\u03b4<\/mi><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">z<\/mi><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><mi mathvariant=\"bold\">z<\/mi><mo>\u223c<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x} \\;\\leftarrow\\; \\mathbf{x} + \\tfrac{\\delta}{2}\\,\\nabla\\!\\log p(\\mathbf{x}) \\;+\\; \\sqrt{\\delta}\\,\\mathbf{z},\\qquad \\mathbf{z}\\sim\\mathcal{N}(\\mathbf{0},\\mathbf{I}).<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Here <strong><math><semantics><mrow><mi>\u03b4<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\delta<\/annotation><\/semantics><\/math> is the Langevin step size<\/strong>, a hyperparameter that controls how aggressively the corrector moves toward higher-density regions per Langevin step. (It is <em>not<\/em> the SDE timestep <math><semantics><mrow><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\Delta t<\/annotation><\/semantics><\/math>, it is the discretisation parameter of a <em>different<\/em> SDE that runs at <em>fixed<\/em> outer time.) A typical PC sampling iteration looks like:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\"><strong>Predictor<\/strong>: One reverse-SDE step from <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}<\/annotation><\/semantics><\/math> at time <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> to <math><semantics><mrow><msup><mi mathvariant=\"bold\">x<\/mi><mtext>pred<\/mtext><\/msup><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}^{\\text{pred}}<\/annotation><\/semantics><\/math> at time <math><semantics><mrow><mi>t<\/mi><mo>\u2212<\/mo><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t &#8211; \\Delta t<\/annotation><\/semantics><\/math> (e.g. Euler\u2013Maruyama with the learnt score).<\/li>\n\n\n\n<li class=\"\"><strong>Corrector<\/strong>: <math><semantics><mrow><mi>K<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">K<\/annotation><\/semantics><\/math> Langevin steps at fixed time <math><semantics><mrow><mi>t<\/mi><mo>\u2212<\/mo><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t &#8211; \\Delta t<\/annotation><\/semantics><\/math> targeting <math><semantics><mrow><msub><mi>p<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_{t-\\Delta t}<\/annotation><\/semantics><\/math>:<br><math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo>\u2190<\/mo><mi mathvariant=\"bold\">x<\/mi><mo>+<\/mo><mfrac><mi>\u03b4<\/mi><mn>2<\/mn><\/mfrac><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">s<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">,<\/mo><mtext>\u2009<\/mtext><mi>t<\/mi><mo>\u2212<\/mo><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>+<\/mo><msqrt><mi>\u03b4<\/mi><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">z<\/mi><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x} \\leftarrow \\mathbf{x} + \\tfrac{\\delta}{2}\\,\\mathbf{s}_\\theta(\\mathbf{x},\\,t-\\Delta t) + \\sqrt{\\delta}\\,\\mathbf{z}.<\/annotation><\/semantics><\/math><\/li>\n<\/ol>\n\n\n\n<p class=\"\">The number of Langevin steps <math><semantics><mrow><mi>K<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">K<\/annotation><\/semantics><\/math> and the step size <math><semantics><mrow><mi>\u03b4<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\delta<\/annotation><\/semantics><\/math> are tuned per problem; <a href=\"#ref-song2021b\">Song et al. (2021b)<\/a> typically use <math><semantics><mrow><mi>K<\/mi><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">K = 1<\/annotation><\/semantics><\/math> and a small <math><semantics><mrow><mi>\u03b4<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\delta<\/annotation><\/semantics><\/math>, choosing it so that the &#8220;signal-to-noise ratio&#8221; of the Langevin update matches a target value. PC samplers gave a noticeable quality boost over plain Euler\u2013Maruyama at the same total number of function evaluations, and the principle, <em>interleave deterministic transport with score-driven stochastic correction<\/em>, reappears in EDM&#8217;s stochastic sampler.<\/p>\n\n\n\n<div class=\"bg-blue-50\/50 dark:bg-blue-900\/20 p-4 rounded-lg border border-blue-100 dark:border-blue-800 my-4\">\n<p><em>Why this looks familiar to anyone who has run a molecular dynamics simulation.<\/em> Langevin dynamics is <em>exactly<\/em> the equation of motion used in MD with a stochastic thermostat: under the identification <math><semantics><mrow><mi>log<\/mi><mo>\u2061<\/mo><mi>p<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><mo>\u2194<\/mo><mo>\u2212<\/mo><mi>U<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><msub><mi>k<\/mi><mi>B<\/mi><\/msub><mi>T<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\log p(\\mathbf{x}) \\leftrightarrow -U(\\mathbf{x})\/k_B T<\/annotation><\/semantics><\/math> (with <math><semantics><mrow><mi>U<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">U<\/annotation><\/semantics><\/math> a potential energy and <math><semantics><mrow><mi>T<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">T<\/annotation><\/semantics><\/math> a temperature), the term <math><semantics><mrow><mi mathvariant=\"normal\">\u2207<\/mi><mtext>\u2009\u2063<\/mtext><mi>log<\/mi><mo>\u2061<\/mo><mi>p<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\nabla\\!\\log p(\\mathbf{x})<\/annotation><\/semantics><\/math> becomes the negative gradient of a potential, i.e. a <em>force<\/em>. So the Langevin step<\/p>\n<p><math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mtext>\u2005\u200a<\/mtext><mo>\u2190<\/mo><mtext>\u2005\u200a<\/mtext><mi mathvariant=\"bold\">x<\/mi><mo>+<\/mo><mfrac><mi>\u03b4<\/mi><mn>2<\/mn><\/mfrac><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u2207<\/mi><mtext>\u2009\u2063<\/mtext><mi>log<\/mi><mo>\u2061<\/mo><mi>p<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><mo>+<\/mo><msqrt><mi>\u03b4<\/mi><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">z<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x} \\;\\leftarrow\\; \\mathbf{x} + \\tfrac{\\delta}{2}\\,\\nabla\\!\\log p(\\mathbf{x}) + \\sqrt{\\delta}\\,\\mathbf{z}<\/annotation><\/semantics><\/math><\/p>\n<p>reads, in MD language, as &#8220;take a small step in the direction of the force, and add a thermal kick.&#8221; Just as MD with a Langevin thermostat samples the Boltzmann distribution <math><semantics><mrow><msup><mi>e<\/mi><mrow><mo>\u2212<\/mo><mi>U<\/mi><mi mathvariant=\"normal\">\/<\/mi><msub><mi>k<\/mi><mi>B<\/mi><\/msub><mi>T<\/mi><\/mrow><\/msup><\/mrow><annotation encoding=\"application\/x-tex\">e^{-U\/k_B T}<\/annotation><\/semantics><\/math> at long times, our corrector samples the target distribution <math><semantics><mrow><mi>p<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">p<\/annotation><\/semantics><\/math> at long times. The score <math><semantics><mrow><msub><mi mathvariant=\"bold\">s<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">,<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{s}_\\theta(\\mathbf{x},t)<\/annotation><\/semantics><\/math> played by the neural network is, by analogy, a <em>learnt force field<\/em>: a position-dependent vector field whose flow lines lead the sampler toward high-density regions of <math><semantics><mrow><msub><mi>p<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_t<\/annotation><\/semantics><\/math>. The thermal noise <math><semantics><mrow><msqrt><mi>\u03b4<\/mi><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">z<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{\\delta}\\,\\mathbf{z}<\/annotation><\/semantics><\/math> prevents the sample from collapsing into a single energy minimum (a single mode of <math><semantics><mrow><mi>p<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">p<\/annotation><\/semantics><\/math>), exactly as it does in MD at finite temperature. This is a useful mental picture when you want to reason about ergodicity, mode mixing, or step-size choices in the sampler: most of the intuition built up in computational statistical mechanics transfers directly.<\/p>\n<\/div>\n\n\n\n<div class=\"h-16\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">4. EDM: Elucidating (and Simplifying) the Design Space<\/h2>\n\n\n\n<p class=\"\">By 2022 the diffusion literature had accumulated a small zoo of choices (VP vs VE, <math><semantics><mrow><mi>\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\epsilon<\/annotation><\/semantics><\/math>&#8211; vs score- vs <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math>-prediction, linear vs cosine schedule, ancestral vs DDIM vs PC) and it was not clear which combinations mattered for sample quality and which were essentially cosmetic. <strong><a href=\"#ref-karras2022\">Karras, Aittala, Aila &amp; Laine (2022)<\/a><\/strong>, in &#8220;Elucidating the Design Space of Diffusion-Based Generative Models&#8221; (EDM), attacked this head-on by re-deriving everything in a single, opinionated parameterisation and then ablating each design choice independently. The result is what has become the default in modern image-diffusion and protein-diffusion codebases.<\/p>\n\n\n\n<div class=\"bg-blue-50\/50 dark:bg-blue-900\/20 p-4 rounded-lg border border-blue-100 dark:border-blue-800 my-4\">\n<p><em>Source.<\/em> The exposition of the EDM sampler in this section follows closely the <a href=\"https:\/\/www.youtube.com\/watch?v=T0Qxzf0eaio&amp;list=PPSV&amp;t=1430s\">excellent talk by the EDM authors<\/a> on the design space of diffusion samplers. I highly recommend watching it in full: it is the clearest single source I know of for the geometric intuition behind the <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>-parameterisation, preconditioning, and Heun-plus-churn sampling.<\/p>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">The &#8220;x-ray view&#8221;: decoupling the design choices<\/h3>\n\n\n\n<p class=\"\">The framing the EDM authors themselves use is worth absorbing. Before EDM, methods like VP, VE, iDDPM, and DDIM looked like <em>tightly coupled packages<\/em> you had to take as a whole: you couldn&#8217;t, say, combine the iDDPM noise schedule with the DDIM sampler without something breaking, because each package&#8217;s choices were implicitly tangled. EDM gives an <em>x-ray view<\/em> into the internals: it shows that any such method decomposes into a small set of <em>orthogonal<\/em> design choices,<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">the <strong>ODE \/ SDE solver<\/strong> (Euler, Heun, RK45, \u2026),<\/li>\n\n\n\n<li class=\"\">the <strong>time-step discretisation<\/strong> (where to place the <math><semantics><mrow><mi>N<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">N<\/annotation><\/semantics><\/math> sampling points along <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>),<\/li>\n\n\n\n<li class=\"\">the <strong>noise schedule<\/strong> <math><semantics><mrow><mi>\u03c3<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma(t)<\/annotation><\/semantics><\/math> (how <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> grows with the abstract time variable),<\/li>\n\n\n\n<li class=\"\">the <strong>signal scaling<\/strong> <math><semantics><mrow><mi>s<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">s(t)<\/annotation><\/semantics><\/math> (whether and how <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> is rescaled),<\/li>\n\n\n\n<li class=\"\">the <strong>network preconditioning<\/strong> (<math><semantics><mrow><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><mo separator=\"true\">,<\/mo><msub><mi>c<\/mi><mtext>out<\/mtext><\/msub><mo separator=\"true\">,<\/mo><msub><mi>c<\/mi><mtext>in<\/mtext><\/msub><mo separator=\"true\">,<\/mo><msub><mi>c<\/mi><mtext>noise<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{skip}}, c_{\\text{out}}, c_{\\text{in}}, c_{\\text{noise}}<\/annotation><\/semantics><\/math>),<\/li>\n\n\n\n<li class=\"\">the <strong>training noise distribution<\/strong> <math><semantics><mrow><mi>p<\/mi><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">p(\\sigma)<\/annotation><\/semantics><\/math>,<\/li>\n\n\n\n<li class=\"\">the <strong>loss weighting<\/strong> <math><semantics><mrow><mi>\u03bb<\/mi><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\lambda(\\sigma)<\/annotation><\/semantics><\/math>,<\/li>\n<\/ul>\n\n\n\n<p class=\"\">each of which can be tuned in isolation, <em>without<\/em> retraining the network and without affecting the validity of the other choices. Armed with this decomposition, <a href=\"#ref-karras2022\">Karras et al.<\/a> systematically searched for the best individual setting of each axis and stacked the improvements.<\/p>\n\n\n\n<div class=\"bg-blue-50\/50 dark:bg-blue-900\/20 p-4 rounded-lg border border-blue-100 dark:border-blue-800 my-4\">\n<p><em>Sampling and training can be mixed and matched freely.<\/em> An immediately useful practical consequence of the decoupling above: the <em>training-time<\/em> choices (preconditioning, loss weighting, noise distribution) and the <em>sampling-time<\/em> choices (solver, time-step schedule, churn) live on independent axes. You can take a network trained with the iDDPM noise distribution and weighting, and run it with the EDM Heun sampler at inference, no retraining required. Indeed, much of the empirical force of the EDM paper comes from precisely this experiment: <a href=\"#ref-karras2022\">Karras et al.<\/a> apply their improved sampler to off-the-shelf pre-trained VP, VE, and iDDPM models and obtain large quality gains <em>without touching the weights<\/em>. The corollary is that sampler design and training design can be iterated on independently, a huge practical win, because re-training a diffusion model is expensive while changing the sampler is essentially free.<\/p>\n<\/div>\n\n\n\n<p class=\"\">Before going through their choices it helps to name the two distinct error sources in sampling, which we want to analyse in isolation:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\"><strong>Network (denoiser) error<\/strong>: even if numerical integration were exact, the denoiser <math><semantics><mrow><msub><mi>D<\/mi><mi>\u03b8<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">D_\\theta<\/annotation><\/semantics><\/math> is only an approximation of the optimal denoiser, so it pushes the trajectory in a slightly wrong direction at each step.<\/li>\n\n\n\n<li class=\"\"><strong>Truncation error<\/strong>: the deterministic trajectory we want to integrate is a curve in <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">,<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(\\mathbf{x},\\sigma)<\/annotation><\/semantics><\/math>-space, and any solver discretises it with straight segments. Larger steps mean larger curvature-induced deviations from the true curve.<\/li>\n<\/ol>\n\n\n\n<p class=\"\">These two sources interact (worse network \u2192 larger drift \u2192 harder for the solver), but the <em>sampler<\/em> design controls primarily the second, while the <em>training<\/em> design controls primarily the first. EDM separates them cleanly.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>-parameterisation<\/h3>\n\n\n\n<p class=\"\">EDM throws out the discrete time index <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> and works directly with the noise level <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>. The forward path is the simplest possible:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><mi>\u03c3<\/mi><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo>\u223c<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}(\\sigma) \\;=\\; \\mathbf{x}_0 \\;+\\; \\sigma\\,\\boldsymbol{\\epsilon},\\qquad \\boldsymbol{\\epsilon}\\sim\\mathcal{N}(\\mathbf{0},\\mathbf{I}),<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">i.e. a variance-exploding path with <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> playing both the role of &#8220;time&#8221; and &#8220;noise scale.&#8221; <a href=\"#ref-karras2022\">Karras et al.<\/a> pick <math><semantics><mrow><mi>\u03c3<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma(t) = t<\/annotation><\/semantics><\/math> so that <math><semantics><mrow><mi mathvariant=\"normal\">d<\/mi><mi>\u03c3<\/mi><mo>=<\/mo><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{d}\\sigma = \\mathrm{d}t<\/annotation><\/semantics><\/math> and no signal-scaling <math><semantics><mrow><mi>s<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo mathvariant=\"normal\">\u2260<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">s(t)\\ne 1<\/annotation><\/semantics><\/math> is applied. Their justification, which I think is the single deepest insight of the paper, is given in the next subsection.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Two ways to deal with the signal-magnitude problem<\/h3>\n\n\n\n<p class=\"\">A practical concern with the VE path is that for large <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>, the magnitude of <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo>=<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><mi>\u03c3<\/mi><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x} = \\mathbf{x}_0 + \\sigma\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> becomes huge, values in the tens or hundreds, well outside the comfortable unit-scale range that neural networks like to operate in. Naively feeding <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}<\/annotation><\/semantics><\/math> at <math><semantics><mrow><mi>\u03c3<\/mi><mo>=<\/mo><mn>80<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma = 80<\/annotation><\/semantics><\/math> to a CNN is asking for unstable training.<\/p>\n\n\n\n<p class=\"\">There are exactly two principled cures for this, and they correspond to the two camps in the pre-EDM literature:<\/p>\n\n\n\n<p class=\"\"><strong>Cure 1 (VP-style): rescale the signal inside the forward path.<\/strong> Introduce an explicit <em>signal scaling<\/em> <math><semantics><mrow><mi>s<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">s(t)<\/annotation><\/semantics><\/math> so that the actual sample fed to the network is <math><semantics><mrow><mi>s<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">s(t)\\mathbf{x}(t)<\/annotation><\/semantics><\/math>, with <math><semantics><mrow><mi>s<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">s(t)<\/annotation><\/semantics><\/math> chosen to keep the magnitude of <math><semantics><mrow><mi>s<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">s(t)\\mathbf{x}(t)<\/annotation><\/semantics><\/math> bounded. This is what VP does implicitly: setting <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t = \\sqrt{\\bar{\\alpha}_t}\\,\\mathbf{x}_0 + \\sqrt{1-\\bar{\\alpha}_t}\\,\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> is exactly a rescaled version of the VE path, with <math><semantics><mrow><mi>s<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">s(t) = \\sqrt{\\bar{\\alpha}_t}<\/annotation><\/semantics><\/math> squeezing the signal down to keep total variance at 1.<\/p>\n\n\n\n<p class=\"\"><strong>Cure 2 (VE + network preconditioning): leave the path alone, rescale at the network&#8217;s input and output.<\/strong> Keep the simple VE path <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><mi>\u03c3<\/mi><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}(\\sigma) = \\mathbf{x}_0 + \\sigma\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> in its raw form, but wrap the neural network so that its <em>input<\/em> is divided by <math><semantics><mrow><msqrt><mrow><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo>+<\/mo><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><\/mrow><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{\\sigma^2 + \\sigma_{\\text{data}}^2}<\/annotation><\/semantics><\/math> (giving unit-variance input) and its <em>output<\/em> is multiplied by an appropriate <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>-dependent factor. From the network&#8217;s point of view it always sees well-scaled tensors, even though the underlying ODE state <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}<\/annotation><\/semantics><\/math> may have huge magnitude.<\/p>\n\n\n\n<p class=\"\">Both cures solve the same numerical problem and let the network see well-scaled tensors throughout training and sampling. But they have very different geometric consequences for the underlying ODE trajectory. The VP rescaling <em>distorts<\/em> the natural flow lines of the ODE: it bends what would be a near-linear trajectory in <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}<\/annotation><\/semantics><\/math>-space into a curved one. The VE-plus-preconditioning approach leaves the ODE alone (the flow lines retain whatever natural geometry they have) and absorbs the multi-scale problem inside the network. EDM argues, and demonstrates empirically, that the second cure is strictly better: straighter flow lines mean fewer Euler steps are needed for the same quality. This is why EDM commits to <math><semantics><mrow><mi>s<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">s(t) = 1<\/annotation><\/semantics><\/math> (no signal scaling in the ODE) and pushes all the magnitude management into preconditioning.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Deriving the EDM ODE from the forward path<\/h3>\n\n\n\n<p class=\"\">The probability-flow ODE for a general SDE was given in Section 3. For the VE-SDE with <math><semantics><mrow><mi mathvariant=\"bold\">f<\/mi><mo>=<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{f} = 0<\/annotation><\/semantics><\/math> and <math><semantics><mrow><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><msqrt><mrow><mi mathvariant=\"normal\">d<\/mi><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><\/mrow><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">g(t) = \\sqrt{\\mathrm{d}\\sigma^2(t)\/\\mathrm{d}t}<\/annotation><\/semantics><\/math> it becomes<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mfrac><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><\/mrow><\/mfrac><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mo>\u2212<\/mo><mtext>\u2009<\/mtext><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mn>1<\/mn><mn>2<\/mn><\/mfrac><\/mstyle><mtext>\u2009<\/mtext><mfrac><mrow><mi mathvariant=\"normal\">d<\/mi><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><\/mrow><\/mfrac><mtext>\u2009<\/mtext><msub><mi mathvariant=\"normal\">\u2207<\/mi><mrow><mtext>\u2009\u2063<\/mtext><mi mathvariant=\"bold\">x<\/mi><\/mrow><\/msub><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\frac{\\mathrm{d}\\mathbf{x}}{\\mathrm{d}t} \\;=\\; -\\,\\tfrac{1}{2}\\,\\frac{\\mathrm{d}\\sigma^2(t)}{\\mathrm{d}t}\\,\\nabla_{\\!\\mathbf{x}}\\log p_t(\\mathbf{x}).<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Now invoke <strong>Tweedie&#8217;s formula<\/strong>: for <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo>=<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><mi>\u03c3<\/mi><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x} = \\mathbf{x}_0 + \\sigma\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> with <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>\u223c<\/mo><msub><mi>p<\/mi><mtext>data<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0\\sim p_{\\text{data}}<\/annotation><\/semantics><\/math>,<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi mathvariant=\"double-struck\">E<\/mi><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">[<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mtext>\u2009<\/mtext><mo fence=\"false\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">\u2223<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">x<\/mi><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">]<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi mathvariant=\"bold\">x<\/mi><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mtext>\u2009<\/mtext><msub><mi mathvariant=\"normal\">\u2207<\/mi><mrow><mtext>\u2009\u2063<\/mtext><mi mathvariant=\"bold\">x<\/mi><\/mrow><\/msub><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>\u03c3<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbb{E}\\bigl[\\mathbf{x}_0\\,\\big|\\,\\mathbf{x}\\bigr] \\;=\\; \\mathbf{x} \\;+\\; \\sigma^2\\,\\nabla_{\\!\\mathbf{x}}\\log p_\\sigma(\\mathbf{x}),<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">equivalently <math><semantics><mrow><msub><mi mathvariant=\"normal\">\u2207<\/mi><mrow><mtext>\u2009\u2063<\/mtext><mi mathvariant=\"bold\">x<\/mi><\/mrow><\/msub><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>\u03c3<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mo stretchy=\"false\">(<\/mo><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mo>\u2212<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><\/mrow><annotation encoding=\"application\/x-tex\">\\nabla_{\\!\\mathbf{x}}\\log p_\\sigma(\\mathbf{x}) = (D(\\mathbf{x};\\sigma) &#8211; \\mathbf{x})\/\\sigma^2<\/annotation><\/semantics><\/math>, where <math><semantics><mrow><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mo>:<\/mo><mo>=<\/mo><mi mathvariant=\"double-struck\">E<\/mi><mo stretchy=\"false\">[<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>\u2223<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">]<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">D(\\mathbf{x};\\sigma) := \\mathbb{E}[\\mathbf{x}_0\\mid\\mathbf{x}]<\/annotation><\/semantics><\/math> is the <strong>optimal MMSE denoiser<\/strong>. <em>MMSE<\/em> stands for <strong>minimum mean squared error<\/strong>: among all measurable functions <math><semantics><mrow><mi>f<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">f(\\mathbf{x})<\/annotation><\/semantics><\/math> predicting <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> from a noisy observation <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}<\/annotation><\/semantics><\/math>, the conditional expectation <math><semantics><mrow><mi mathvariant=\"double-struck\">E<\/mi><mo stretchy=\"false\">[<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>\u2223<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">]<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbb{E}[\\mathbf{x}_0\\mid\\mathbf{x}]<\/annotation><\/semantics><\/math> is the unique minimiser of the expected squared error <math><semantics><mrow><mi mathvariant=\"double-struck\">E<\/mi><mo stretchy=\"false\">\u2225<\/mo><mi>f<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><mo>\u2212<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><msup><mo stretchy=\"false\">\u2225<\/mo><mn>2<\/mn><\/msup><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbb{E}\\lVert f(\\mathbf{x}) &#8211; \\mathbf{x}_0\\rVert^2<\/annotation><\/semantics><\/math>. So <math><semantics><mrow><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">D(\\mathbf{x};\\sigma)<\/annotation><\/semantics><\/math> is precisely the object that a network trained with an MSE loss would converge to in the infinite-data limit: predict the clean image given the noisy one. Tweedie&#8217;s formula tells us this denoiser and the score are two views of the same underlying quantity.<\/p>\n\n\n\n<p class=\"\">Substituting the Tweedie expression and using <math><semantics><mrow><mi>\u03c3<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma(t) = t<\/annotation><\/semantics><\/math> so that <math><semantics><mrow><mi mathvariant=\"normal\">d<\/mi><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mi mathvariant=\"normal\">\/<\/mi><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><mo>=<\/mo><mn>2<\/mn><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{d}\\sigma^2\/\\mathrm{d}t = 2t<\/annotation><\/semantics><\/math>:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mfrac><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><\/mrow><\/mfrac><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mo>\u2212<\/mo><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mn>1<\/mn><mn>2<\/mn><\/mfrac><\/mstyle><mo>\u22c5<\/mo><mn>2<\/mn><mi>t<\/mi><mo>\u22c5<\/mo><mfrac><mrow><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>\u2212<\/mo><mi mathvariant=\"bold\">x<\/mi><\/mrow><msup><mi>t<\/mi><mn>2<\/mn><\/msup><\/mfrac><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><mi>t<\/mi><\/mfrac><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\frac{\\mathrm{d}\\mathbf{x}}{\\mathrm{d}t} \\;=\\; -\\tfrac{1}{2}\\cdot 2t\\cdot\\frac{D(\\mathbf{x};t) &#8211; \\mathbf{x}}{t^2} \\;=\\; \\frac{\\mathbf{x} &#8211; D(\\mathbf{x};t)}{t}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">This is the <strong>EDM ODE<\/strong>, the most economical form one can write for the deterministic generative dynamics. The sign convention deserves a careful word, because at first glance the formula looks like it points the wrong way.<\/p>\n\n\n\n<p class=\"\">First, a clarification that is worth being explicit about. The forward <em>noising<\/em> introduced at the start of this section, <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><mi>\u03c3<\/mi><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}(\\sigma) = \\mathbf{x}_0 + \\sigma\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>, is the marginal of the <strong>stochastic<\/strong> VE-SDE <math><semantics><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><mo>=<\/mo><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">w<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{d}\\mathbf{x} = g(t)\\,\\mathrm{d}\\mathbf{w}<\/annotation><\/semantics><\/math>. The equation <math><semantics><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><mi mathvariant=\"normal\">\/<\/mi><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><mo>=<\/mo><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><mi>D<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{d}\\mathbf{x}\/\\mathrm{d}t = (\\mathbf{x}-D)\/t<\/annotation><\/semantics><\/math> we just derived is a <strong>different<\/strong> equation: it is the probability-flow ODE, a <em>deterministic<\/em> dynamical system whose only relation to the noising SDE is that it shares the same marginal distributions <math><semantics><mrow><msub><mi>p<\/mi><mi>\u03c3<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_\\sigma<\/annotation><\/semantics><\/math> at each noise level. They are not the same trajectory and not the same equation; they only agree on the one-time marginals.<\/p>\n\n\n\n<p class=\"\">The defining property of the PF-ODE, stated explicitly by <a href=\"#ref-karras2022\">Karras et al. (2022)<\/a> (Section 2), is that integrating it from <math><semantics><mrow><msub><mi>t<\/mi><mi>a<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">t_a<\/annotation><\/semantics><\/math> to <math><semantics><mrow><msub><mi>t<\/mi><mi>b<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">t_b<\/annotation><\/semantics><\/math> \u2014 <em>either forward or backward in <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math><\/em> \u2014 produces a sample distributed according to the marginal <math><semantics><mrow><msub><mi>p<\/mi><mrow><mi>\u03c3<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi>t<\/mi><mi>b<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">p_{\\sigma(t_b)}<\/annotation><\/semantics><\/math>. Because the PF-ODE is deterministic, its solutions can be traversed in either direction of <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math>, and Karras et al. give the intuition in two complementary sentences: &#8220;an infinitesimal <em>forward<\/em> step of this ODE nudges the sample away from the data&#8230; equivalently, a <em>backward<\/em> step nudges the sample towards the data distribution.&#8221; So the <em>same<\/em> PF-ODE can be used either to corrupt (integrate with <math><semantics><mrow><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><mo>&gt;<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\Delta t &gt; 0<\/annotation><\/semantics><\/math>, toward larger noise) or to generate (integrate with <math><semantics><mrow><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><mo>&lt;<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\Delta t &lt; 0<\/annotation><\/semantics><\/math>, toward smaller noise).<\/p>\n\n\n\n<p class=\"\">Now interpret the formula in each direction. Read with <em>increasing<\/em> <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> (increasing noise level <math><semantics><mrow><mi>\u03c3<\/mi><mo>=<\/mo><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma = t<\/annotation><\/semantics><\/math>), the velocity vector <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><mi>D<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">(\\mathbf{x}-D)\/t<\/annotation><\/semantics><\/math> points in the direction <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><mi>D<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x} &#8211; D<\/annotation><\/semantics><\/math>, i.e. <em>away from<\/em> the denoiser&#8217;s estimate of the clean image \u2014 consistent with the corruption story (sample drifts away from the data manifold as <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> grows). To generate samples we integrate with <math><semantics><mrow><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><mo>&lt;<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\Delta t &lt; 0<\/annotation><\/semantics><\/math>, so a single Euler step is <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo>\u2190<\/mo><mi mathvariant=\"bold\">x<\/mi><mo>+<\/mo><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><mo>\u22c5<\/mo><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><mi>D<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><mi>t<\/mi><mo>=<\/mo><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><mi mathvariant=\"normal\">\u2223<\/mi><mi mathvariant=\"normal\">\u0394<\/mi><mi>t<\/mi><mi mathvariant=\"normal\">\u2223<\/mi><mo>\u22c5<\/mo><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><mi>D<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x} \\leftarrow \\mathbf{x} + \\Delta t \\cdot (\\mathbf{x}-D)\/t = \\mathbf{x} &#8211; |\\Delta t|\\cdot(\\mathbf{x}-D)\/t<\/annotation><\/semantics><\/math>, which is a step in the direction <math><semantics><mrow><mi>D<\/mi><mo>\u2212<\/mo><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">D &#8211; \\mathbf{x}<\/annotation><\/semantics><\/math>, <em>toward<\/em> the current best guess of the clean image. The sampler does the intuitive thing.<\/p>\n\n\n\n<p class=\"\">The apparent singularity at <math><semantics><mrow><mi>t<\/mi><mo>\u2192<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t \\to 0<\/annotation><\/semantics><\/math> also deserves a closer look. The numerator <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x} &#8211; D(\\mathbf{x};t)<\/annotation><\/semantics><\/math> and the denominator <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> both vanish: as <math><semantics><mrow><mi>t<\/mi><mo>\u2192<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t \\to 0<\/annotation><\/semantics><\/math> the optimal denoiser becomes the identity (<math><semantics><mrow><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mn>0<\/mn><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">D(\\mathbf{x};0) = \\mathbf{x}<\/annotation><\/semantics><\/math>), so <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>\u2192<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x} &#8211; D(\\mathbf{x};t) \\to 0<\/annotation><\/semantics><\/math>. The two zeros race each other and the resulting limit is finite. To see this concretely, expand <math><semantics><mrow><mi>D<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">D<\/annotation><\/semantics><\/math> to first order in <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math>: for small <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math>, <math><semantics><mrow><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>\u2248<\/mo><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><msup><mi>t<\/mi><mn>2<\/mn><\/msup><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u2207<\/mi><mtext>\u2009\u2063<\/mtext><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">D(\\mathbf{x};t) \\approx \\mathbf{x} &#8211; t^2\\,\\nabla\\!\\log p_t(\\mathbf{x})<\/annotation><\/semantics><\/math> (this is just Tweedie&#8217;s formula again, rearranged), so<\/p>\n\n\n\n<p class=\"\"><math><semantics><mrow><mfrac><mrow><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><mi>t<\/mi><\/mfrac><mtext>\u2005\u200a<\/mtext><mo>\u2248<\/mo><mtext>\u2005\u200a<\/mtext><mi>t<\/mi><mtext>\u2009<\/mtext><mi mathvariant=\"normal\">\u2207<\/mi><mtext>\u2009\u2063<\/mtext><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><munderover><mo minsize=\"3.0em\" stretchy=\"true\">\u2192<\/mo><mpadded lspace=\"0.3em\" width=\"+0.6em\"><mrow><mi>t<\/mi><mo>\u2192<\/mo><mn>0<\/mn><\/mrow><\/mpadded><mpadded lspace=\"0.3em\" width=\"+0.6em\"><mrow><\/mrow><\/mpadded><\/munderover><mtext>\u2005\u200a<\/mtext><mn>0.<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\frac{\\mathbf{x} &#8211; D(\\mathbf{x};t)}{t} \\;\\approx\\; t\\,\\nabla\\!\\log p_t(\\mathbf{x}) \\;\\xrightarrow[t \\to 0]{}\\; 0.<\/annotation><\/semantics><\/math><\/p>\n\n\n\n<p class=\"\">The velocity therefore <em>vanishes<\/em> as <math><semantics><mrow><mi>t<\/mi><mo>\u2192<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t \\to 0<\/annotation><\/semantics><\/math>, and the trajectory smoothly settles onto the data manifold rather than blowing up.<\/p>\n\n\n\n<div class=\"bg-blue-50\/50 dark:bg-blue-900\/20 p-4 rounded-lg border border-blue-100 dark:border-blue-800 my-4\">\n<p><em>The tangent points directly to the denoiser output: why this matters.<\/em> Suppose at time <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math> we extrapolate a single linear Euler step from <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}(t)<\/annotation><\/semantics><\/math> all the way down to <math><semantics><mrow><mi>t<\/mi><mo>=<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">t = 0<\/annotation><\/semantics><\/math>. Plugging into <math><semantics><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><mi mathvariant=\"normal\">\/<\/mi><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><mo>=<\/mo><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><mi>D<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{d}\\mathbf{x}\/\\mathrm{d}t = (\\mathbf{x} &#8211; D)\/t<\/annotation><\/semantics><\/math>,<\/p>\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">(<\/mo><mn>0<\/mn><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>\u2248<\/mo><mtext>\u2005\u200a<\/mtext><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>+<\/mo><mo stretchy=\"false\">(<\/mo><mn>0<\/mn><mo>\u2212<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>\u22c5<\/mo><mfrac><mrow><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>\u2212<\/mo><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo separator=\"true\">;<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><mi>t<\/mi><\/mfrac><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo separator=\"true\">;<\/mo><mtext>\u2009<\/mtext><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}(0) \\;\\approx\\; \\mathbf{x}(t) + (0 &#8211; t)\\cdot\\frac{\\mathbf{x}(t) &#8211; D(\\mathbf{x}(t);t)}{t} \\;=\\; D(\\mathbf{x}(t);\\,t).<\/annotation><\/semantics><\/math><\/div>\n<p>So with the <math><semantics><mrow><mi>\u03c3<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma(t)=t<\/annotation><\/semantics><\/math> schedule, the tangent direction <em>points exactly to the current denoiser output<\/em>. <strong>This is not true for other schedules<\/strong>: for VP, or for any choice with signal-rescaling <math><semantics><mrow><mi>s<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo mathvariant=\"normal\">\u2260<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">s(t) \\ne 1<\/annotation><\/semantics><\/math>, the tangent points to some weird intermediate target that depends on the schedule&#8217;s curvature. Empirically, the denoiser output <math><semantics><mrow><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo separator=\"true\">;<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">D(\\mathbf{x}(t);t)<\/annotation><\/semantics><\/math>, the model&#8217;s best guess of the clean image, <em>changes very slowly<\/em> along the sampling trajectory. Once you have an approximately right guess of the clean image, that guess remains approximately right as you anneal <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> downward. So the tangent direction is roughly constant across steps, which is exactly the condition under which large Euler steps work well. This is the deepest answer to &#8220;why does DDIM-style ODE sampling take such big strides?&#8221;: the dynamics really want this schedule.<\/p>\n<\/div>\n\n\n\n\n<figure class=\"wp-block-video\"><video height=\"1076\" style=\"aspect-ratio: 1920 \/ 1076;\" width=\"1920\" controls src=\"https:\/\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Clip_Linear_noise_schedule.mp4\"><\/video><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Preconditioning: deriving the EDM coefficients<\/h3>\n\n\n\n<p class=\"\">The single most empirically impactful contribution of EDM is the explicit preconditioning of the network. Rather than asking a raw network <math><semantics><mrow><msub><mi>F<\/mi><mi>\u03b8<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">F_\\theta<\/annotation><\/semantics><\/math> to output <math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> or <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> across many orders of magnitude of <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>, <a href=\"#ref-karras2022\">Karras et al.<\/a> write<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi>D<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">x<\/mi><mtext>\u2005\u200a<\/mtext><mo>+<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi>c<\/mi><mtext>out<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><msub><mi>F<\/mi><mi>\u03b8<\/mi><\/msub><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><msub><mi>c<\/mi><mtext>in<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">,<\/mo><mtext>\u2009<\/mtext><msub><mi>c<\/mi><mtext>noise<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">D_\\theta(\\mathbf{x};\\sigma) \\;=\\; c_{\\text{skip}}(\\sigma)\\,\\mathbf{x} \\;+\\; c_{\\text{out}}(\\sigma)\\,F_\\theta\\bigl(c_{\\text{in}}(\\sigma)\\,\\mathbf{x},\\,c_{\\text{noise}}(\\sigma)\\bigr),<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">and <em>derive<\/em> the coefficients <math><semantics><mrow><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><mo separator=\"true\">,<\/mo><msub><mi>c<\/mi><mtext>out<\/mtext><\/msub><mo separator=\"true\">,<\/mo><msub><mi>c<\/mi><mtext>in<\/mtext><\/msub><mo separator=\"true\">,<\/mo><msub><mi>c<\/mi><mtext>noise<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{skip}}, c_{\\text{out}}, c_{\\text{in}}, c_{\\text{noise}}<\/annotation><\/semantics><\/math> analytically from two simple desiderata: the network&#8217;s <em>input<\/em> should have unit variance, and the network&#8217;s <em>training target<\/em> should have unit variance. Let me walk through the derivation.<\/p>\n\n\n\n<p class=\"\">Assume <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> has variance <math><semantics><mrow><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_{\\text{data}}^2<\/annotation><\/semantics><\/math> per coordinate (typically <math><semantics><mrow><msub><mi>\u03c3<\/mi><mtext>data<\/mtext><\/msub><mo>=<\/mo><mn>0.5<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_{\\text{data}} = 0.5<\/annotation><\/semantics><\/math> for normalised images). Then <math><semantics><mrow><mi mathvariant=\"bold\">x<\/mi><mo>=<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><mi>\u03c3<\/mi><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x} = \\mathbf{x}_0 + \\sigma\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> has variance <math><semantics><mrow><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><mo>+<\/mo><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_{\\text{data}}^2 + \\sigma^2<\/annotation><\/semantics><\/math>. For the network&#8217;s input <math><semantics><mrow><msub><mi>c<\/mi><mtext>in<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"bold\">x<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{in}}(\\sigma)\\mathbf{x}<\/annotation><\/semantics><\/math> to have unit variance,<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi>c<\/mi><mtext>in<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mn>1<\/mn><msqrt><mrow><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo>+<\/mo><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><\/mrow><\/msqrt><\/mfrac><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{in}}(\\sigma) \\;=\\; \\frac{1}{\\sqrt{\\sigma^2 + \\sigma_{\\text{data}}^2}}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">The training loss is <math><semantics><mrow><mi mathvariant=\"double-struck\">E<\/mi><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">[<\/mo><mi>\u03bb<\/mi><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mo stretchy=\"false\">\u2225<\/mo><msub><mi>D<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mo>\u2212<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><msup><mo stretchy=\"false\">\u2225<\/mo><mn>2<\/mn><\/msup><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">]<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbb{E}\\bigl[\\lambda(\\sigma)\\,\\lVert D_\\theta(\\mathbf{x};\\sigma) &#8211; \\mathbf{x}_0\\rVert^2\\bigr]<\/annotation><\/semantics><\/math>. Substituting the preconditioned form and rearranging, the network&#8217;s <em>effective target<\/em> is<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msubsup><mi>F<\/mi><mi>\u03b8<\/mi><mrow><mtext>\u2009<\/mtext><mo>\u22c6<\/mo><\/mrow><\/msubsup><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>\u2212<\/mo><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><mi mathvariant=\"bold\">x<\/mi><\/mrow><mrow><msub><mi>c<\/mi><mtext>out<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><\/mfrac><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">F_\\theta^{\\,\\star}(\\sigma) \\;=\\; \\frac{\\mathbf{x}_0 &#8211; c_{\\text{skip}}(\\sigma)\\,\\mathbf{x}}{c_{\\text{out}}(\\sigma)}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">We want <math><semantics><mrow><mrow><mi mathvariant=\"normal\">V<\/mi><mi mathvariant=\"normal\">a<\/mi><mi mathvariant=\"normal\">r<\/mi><\/mrow><mo stretchy=\"false\">(<\/mo><msubsup><mi>F<\/mi><mi>\u03b8<\/mi><mrow><mtext>\u2009<\/mtext><mo>\u22c6<\/mo><\/mrow><\/msubsup><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{Var}(F_\\theta^{\\,\\star}) = 1<\/annotation><\/semantics><\/math>. Expanding the numerator,<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>\u2212<\/mo><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><mi mathvariant=\"bold\">x<\/mi><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>\u2212<\/mo><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><mi>\u03c3<\/mi><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mtext>\u2009<\/mtext><mo>\u2212<\/mo><mtext>\u2009<\/mtext><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><mtext>\u2009<\/mtext><mi>\u03c3<\/mi><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0 &#8211; c_{\\text{skip}}\\mathbf{x} \\;=\\; \\mathbf{x}_0 &#8211; c_{\\text{skip}}(\\mathbf{x}_0 + \\sigma\\boldsymbol{\\epsilon}) \\;=\\; (1-c_{\\text{skip}})\\,\\mathbf{x}_0 \\,-\\, c_{\\text{skip}}\\,\\sigma\\,\\boldsymbol{\\epsilon},<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">with variance <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><msup><mo stretchy=\"false\">)<\/mo><mn>2<\/mn><\/msup><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><mo>+<\/mo><msubsup><mi>c<\/mi><mtext>skip<\/mtext><mn>2<\/mn><\/msubsup><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><\/mrow><annotation encoding=\"application\/x-tex\">(1-c_{\\text{skip}})^2 \\sigma_{\\text{data}}^2 + c_{\\text{skip}}^2 \\sigma^2<\/annotation><\/semantics><\/math>. To minimise the <em>magnitude<\/em> of this target across <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>, equivalently, to give the network the least extreme regression problem, we choose <math><semantics><mrow><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{skip}}<\/annotation><\/semantics><\/math> to minimise this variance. Setting the derivative with respect to <math><semantics><mrow><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{skip}}<\/annotation><\/semantics><\/math> to zero:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mo>\u2212<\/mo><mn>2<\/mn><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><mo stretchy=\"false\">)<\/mo><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><mo>+<\/mo><mn>2<\/mn><mtext>\u2009<\/mtext><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><mtext>\u2009<\/mtext><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo>=<\/mo><mn>0<\/mn><mtext>\u2005\u200a<\/mtext><mtext>\u2005\u200a<\/mtext><mo>\u27f9<\/mo><mtext>\u2005\u200a<\/mtext><mtext>\u2005\u200a<\/mtext><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><mrow><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo>+<\/mo><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><\/mrow><\/mfrac><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">-2(1-c_{\\text{skip}})\\sigma_{\\text{data}}^2 + 2\\,c_{\\text{skip}}\\,\\sigma^2 = 0\n\\;\\;\\Longrightarrow\\;\\;\nc_{\\text{skip}}(\\sigma) \\;=\\; \\frac{\\sigma_{\\text{data}}^2}{\\sigma^2 + \\sigma_{\\text{data}}^2}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Substituting back, the residual variance is<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><msup><mo stretchy=\"false\">)<\/mo><mn>2<\/mn><\/msup><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><mo>+<\/mo><msubsup><mi>c<\/mi><mtext>skip<\/mtext><mn>2<\/mn><\/msubsup><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mtext>\u2009<\/mtext><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><\/mrow><mrow><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo>+<\/mo><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><\/mrow><\/mfrac><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">(1-c_{\\text{skip}})^2 \\sigma_{\\text{data}}^2 + c_{\\text{skip}}^2 \\sigma^2 \\;=\\; \\frac{\\sigma^2\\,\\sigma_{\\text{data}}^2}{\\sigma^2 + \\sigma_{\\text{data}}^2}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">For unit variance of the network target we need <math><semantics><mrow><msubsup><mi>c<\/mi><mtext>out<\/mtext><mn>2<\/mn><\/msubsup><mo>=<\/mo><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><mi mathvariant=\"normal\">\/<\/mi><mo stretchy=\"false\">(<\/mo><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo>+<\/mo><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{out}}^2 = \\sigma^2\\sigma_{\\text{data}}^2\/(\\sigma^2 + \\sigma_{\\text{data}}^2)<\/annotation><\/semantics><\/math>, hence<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi>c<\/mi><mtext>out<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><mi>\u03c3<\/mi><mo>\u22c5<\/mo><msub><mi>\u03c3<\/mi><mtext>data<\/mtext><\/msub><\/mrow><msqrt><mrow><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo>+<\/mo><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><\/mrow><\/msqrt><\/mfrac><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{out}}(\\sigma) \\;=\\; \\frac{\\sigma\\cdot\\sigma_{\\text{data}}}{\\sqrt{\\sigma^2 + \\sigma_{\\text{data}}^2}}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Finally, <math><semantics><mrow><msub><mi>c<\/mi><mtext>noise<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{noise}}<\/annotation><\/semantics><\/math> is the input embedding of the noise level, for which <a href=\"#ref-karras2022\">Karras et al.<\/a> use the heuristic <math><semantics><mrow><msub><mi>c<\/mi><mtext>noise<\/mtext><\/msub><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mfrac><mn>1<\/mn><mn>4<\/mn><\/mfrac><mi>ln<\/mi><mo>\u2061<\/mo><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{noise}}(\\sigma) = \\tfrac{1}{4}\\ln(\\sigma)<\/annotation><\/semantics><\/math>.<\/p>\n\n\n\n<div class=\"bg-blue-50\/50 dark:bg-blue-900\/20 p-4 rounded-lg border border-blue-100 dark:border-blue-800 my-4\">\n<p><em>Why this matters in practice: the two extremes.<\/em> The behaviour of these coefficients at small and large <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> explains why preconditioning is empirically transformative.<\/p>\n<ul>\n<li><strong>At small <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math><\/strong> (clean image, <math><semantics><mrow><mi>\u03c3<\/mi><mo>\u226a<\/mo><msub><mi>\u03c3<\/mi><mtext>data<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma \\ll \\sigma_{\\text{data}}<\/annotation><\/semantics><\/math>): <math><semantics><mrow><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><mo>\u2248<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{skip}} \\approx 1<\/annotation><\/semantics><\/math> and <math><semantics><mrow><msub><mi>c<\/mi><mtext>out<\/mtext><\/msub><mo>\u2248<\/mo><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{out}} \\approx \\sigma<\/annotation><\/semantics><\/math>. The architecture essentially copies the input through the skip, and the network&#8217;s output is a small correction <em>scaled down by <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math><\/em>. If the network makes an error, that error is multiplied by a small number on its way out, <em>downscaled, not amplified<\/em>. The network is freed from having to learn the identity function (a real risk in naive setups) and can focus on producing a small denoising correction.<\/li>\n<li><strong>At large <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math><\/strong> (mostly noise, <math><semantics><mrow><mi>\u03c3<\/mi><mo>\u226b<\/mo><msub><mi>\u03c3<\/mi><mtext>data<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma \\gg \\sigma_{\\text{data}}<\/annotation><\/semantics><\/math>): <math><semantics><mrow><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><mo>\u2248<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{skip}} \\approx 0<\/annotation><\/semantics><\/math> and <math><semantics><mrow><msub><mi>c<\/mi><mtext>out<\/mtext><\/msub><mo>\u2248<\/mo><msub><mi>\u03c3<\/mi><mtext>data<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{out}} \\approx \\sigma_{\\text{data}}<\/annotation><\/semantics><\/math>. The skip is effectively disabled, and the network is asked to predict the clean signal directly. This avoids the disaster of a naive noise-prediction setup, where the network would predict <math><semantics><mrow><mo>\u2212<\/mo><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">-\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math> and have its output multiplied by <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>, meaning a network error of order <math><semantics><mrow><mi>e<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">e<\/annotation><\/semantics><\/math> becomes an output error of order <math><semantics><mrow><mi>\u03c3<\/mi><mi>e<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma e<\/annotation><\/semantics><\/math>, amplified by a potentially huge factor when <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> is in the dozens.<\/li>\n<\/ul>\n<p>The analytic <math><semantics><mrow><msub><mi>c<\/mi><mtext>skip<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">c_{\\text{skip}}<\/annotation><\/semantics><\/math> smoothly interpolates between these two regimes, automatically selecting &#8220;predict residual \/ predict signal&#8221; depending on how noisy the input is. The transition happens around <math><semantics><mrow><mi>\u03c3<\/mi><mo>\u223c<\/mo><msub><mi>\u03c3<\/mi><mtext>data<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma \\sim \\sigma_{\\text{data}}<\/annotation><\/semantics><\/math>. The neural network always sees well-scaled inputs and produces well-scaled outputs; the brunt of the multi-scale problem is absorbed by an analytic skip connection.<\/p>\n<p>A further upside the EDM authors emphasise: because the preconditioning analytically handles the scale variation, the <em>architecture<\/em> of the inner network <math><semantics><mrow><msub><mi>F<\/mi><mi>\u03b8<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">F_\\theta<\/annotation><\/semantics><\/math> is now essentially free to be whatever you want (DDPM-style U-Net, NCSN-style, transformer, etc.) and the preconditioning works for &#8220;any&#8221; architecture. EDM cleanly separates &#8220;what the network is&#8221; from &#8220;how its inputs and outputs are scaled.&#8221;<\/p>\n<\/div>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_skip_connection-scaled.png?ssl=1\"><img decoding=\"async\" width=\"2560\" height=\"1411\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_skip_connection-scaled.png?fit=625%2C344&amp;ssl=1\" alt=\"\" class=\"wp-image-14297\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_skip_connection-scaled.png?w=2560&amp;ssl=1 2560w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_skip_connection-scaled.png?resize=300%2C165&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_skip_connection-scaled.png?resize=1024%2C564&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_skip_connection-scaled.png?resize=768%2C423&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_skip_connection-scaled.png?resize=1536%2C847&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_skip_connection-scaled.png?resize=2048%2C1129&amp;ssl=1 2048w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_skip_connection-scaled.png?resize=624%2C344&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_skip_connection-scaled.png?w=1250&amp;ssl=1 1250w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_skip_connection-scaled.png?w=1875&amp;ssl=1 1875w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">The EDM sampler: Heun with stochastic churn<\/h3>\n\n\n\n<p class=\"\">For sampling, EDM proposes a deterministic <em>and<\/em> a stochastic variant of the same core procedure. The deterministic version is <strong>Heun&#8217;s second-order ODE solver<\/strong> applied to the EDM ODE: for each step from <math><semantics><mrow><msub><mi>t<\/mi><mi>i<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">t_i<\/annotation><\/semantics><\/math> to <math><semantics><mrow><msub><mi>t<\/mi><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">t_{i+1}<\/annotation><\/semantics><\/math>,<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi mathvariant=\"bold\">d<\/mi><mi>i<\/mi><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>i<\/mi><\/msub><mo>\u2212<\/mo><msub><mi>D<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>i<\/mi><\/msub><mo separator=\"true\">;<\/mo><msub><mi>t<\/mi><mi>i<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><msub><mi>t<\/mi><mi>i<\/mi><\/msub><\/mfrac><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>~<\/mo><\/mover><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mi>i<\/mi><\/msub><mo>+<\/mo><mo stretchy=\"false\">(<\/mo><msub><mi>t<\/mi><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2212<\/mo><msub><mi>t<\/mi><mi>i<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">d<\/mi><mi>i<\/mi><\/msub><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{d}_i \\;=\\; \\frac{\\mathbf{x}_i &#8211; D_\\theta(\\mathbf{x}_i; t_i)}{t_i},\\qquad \\tilde{\\mathbf{x}}_{i+1} \\;=\\; \\mathbf{x}_i + (t_{i+1}-t_i)\\,\\mathbf{d}_i,<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msubsup><mi mathvariant=\"bold\">d<\/mi><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><mo lspace=\"0em\" mathvariant=\"normal\" rspace=\"0em\">\u2032<\/mo><\/msubsup><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>~<\/mo><\/mover><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2212<\/mo><msub><mi>D<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>~<\/mo><\/mover><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mo separator=\"true\">;<\/mo><msub><mi>t<\/mi><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><msub><mi>t<\/mi><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mfrac><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mi>i<\/mi><\/msub><mo>+<\/mo><mo stretchy=\"false\">(<\/mo><msub><mi>t<\/mi><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2212<\/mo><msub><mi>t<\/mi><mi>i<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mo>\u22c5<\/mo><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mn>1<\/mn><mn>2<\/mn><\/mfrac><\/mstyle><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><msub><mi mathvariant=\"bold\">d<\/mi><mi>i<\/mi><\/msub><mo>+<\/mo><msubsup><mi mathvariant=\"bold\">d<\/mi><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><mo lspace=\"0em\" mathvariant=\"normal\" rspace=\"0em\">\u2032<\/mo><\/msubsup><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{d}_{i+1}&#8217; \\;=\\; \\frac{\\tilde{\\mathbf{x}}_{i+1} &#8211; D_\\theta(\\tilde{\\mathbf{x}}_{i+1};t_{i+1})}{t_{i+1}},\\qquad \\mathbf{x}_{i+1} \\;=\\; \\mathbf{x}_i + (t_{i+1}-t_i)\\cdot\\tfrac{1}{2}\\bigl(\\mathbf{d}_i + \\mathbf{d}_{i+1}&#8217;\\bigr).<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">Heun is a <em>predictor\u2013corrector<\/em> scheme in the numerical-analysis sense: take a tentative Euler step, evaluate the slope at the predicted endpoint, then re-step using the average of the two slopes. The second-order trapezoidal correction halves the truncation error per step compared to plain Euler (or first-order DDIM) at the cost of one extra denoiser evaluation per step. <a href=\"#ref-karras2022\">Karras et al.<\/a> tested various higher-order solvers (RK45, linear multistep methods) and reported that Heun strikes the best quality-per-NFE tradeoff.<\/p>\n\n\n\n<p class=\"\">The stochastic variant adds Langevin-style <strong>churn<\/strong>: at the start of each step, inflate the current sample&#8217;s noise level by a small factor, then take a deterministic Heun step from the inflated level back down to <math><semantics><mrow><msub><mi>t<\/mi><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">t_{i+1}<\/annotation><\/semantics><\/math>. Concretely:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi>\u03b3<\/mi><mi>i<\/mi><\/msub><mo>=<\/mo><mi>min<\/mi><mo>\u2061<\/mo><mtext>\u2009\u2063<\/mtext><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><msub><mi>S<\/mi><mtext>churn<\/mtext><\/msub><mi mathvariant=\"normal\">\/<\/mi><mi>N<\/mi><mo separator=\"true\">,<\/mo><mtext>\u2009<\/mtext><msqrt><mn>2<\/mn><\/msqrt><mo>\u2212<\/mo><mn>1<\/mn><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><mtext>\u00a0if\u00a0<\/mtext><msub><mi>S<\/mi><mtext>min<\/mtext><\/msub><mo>\u2264<\/mo><msub><mi>t<\/mi><mi>i<\/mi><\/msub><mo>\u2264<\/mo><msub><mi>S<\/mi><mtext>max<\/mtext><\/msub><mo separator=\"true\">,<\/mo><mtext>\u00a0else\u00a0<\/mtext><mn>0<\/mn><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\gamma_i = \\min\\!\\bigl(S_{\\text{churn}}\/N,\\,\\sqrt{2}-1\\bigr) \\text{ if } S_{\\text{min}}\\le t_i\\le S_{\\text{max}}, \\text{ else } 0,<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mover accent=\"true\"><mi>t<\/mi><mo>^<\/mo><\/mover><mi>i<\/mi><\/msub><mo>=<\/mo><msub><mi>t<\/mi><mi>i<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>+<\/mo><msub><mi>\u03b3<\/mi><mi>i<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mi>i<\/mi><\/msub><mo>=<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>i<\/mi><\/msub><mo>+<\/mo><msqrt><mrow><msubsup><mover accent=\"true\"><mi>t<\/mi><mo>^<\/mo><\/mover><mi>i<\/mi><mn>2<\/mn><\/msubsup><mo>\u2212<\/mo><msubsup><mi>t<\/mi><mi>i<\/mi><mn>2<\/mn><\/msubsup><\/mrow><\/msqrt><mo>\u22c5<\/mo><msub><mi>S<\/mi><mtext>noise<\/mtext><\/msub><mo>\u22c5<\/mo><mi mathvariant=\"bold-italic\">\u03f5<\/mi><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\hat{t}_i = t_i(1+\\gamma_i),\\qquad \\hat{\\mathbf{x}}_i = \\mathbf{x}_i + \\sqrt{\\hat{t}_i^2 &#8211; t_i^2}\\cdot S_{\\text{noise}}\\cdot\\boldsymbol{\\epsilon},<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">followed by a Heun step from <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mi>i<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mover accent=\"true\"><mi>t<\/mi><mo>^<\/mo><\/mover><mi>i<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(\\hat{\\mathbf{x}}_i, \\hat{t}_i)<\/annotation><\/semantics><\/math> to <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mo separator=\"true\">,<\/mo><msub><mi>t<\/mi><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(\\mathbf{x}_{i+1}, t_{i+1})<\/annotation><\/semantics><\/math>. The hyperparameters <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><msub><mi>S<\/mi><mtext>churn<\/mtext><\/msub><mo separator=\"true\">,<\/mo><msub><mi>S<\/mi><mtext>min<\/mtext><\/msub><mo separator=\"true\">,<\/mo><msub><mi>S<\/mi><mtext>max<\/mtext><\/msub><mo separator=\"true\">,<\/mo><msub><mi>S<\/mi><mtext>noise<\/mtext><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(S_{\\text{churn}}, S_{\\text{min}}, S_{\\text{max}}, S_{\\text{noise}})<\/annotation><\/semantics><\/math> control how much, where in the schedule, and with what amplification noise is reinjected. The interpretation is exactly the predictor\u2013corrector logic from Section 3 in a different disguise: the noise injection is the &#8220;Langevin exploration&#8221; that washes out deterministic drift error; the Heun step is the deterministic transport that follows the ODE flow lines. Putting them together at each step is empirically more controllable than running a full reverse-SDE solver.<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"1076\" style=\"aspect-ratio: 1920 \/ 1076;\" width=\"1920\" controls src=\"https:\/\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Clip_Heun_step.mp4\"><\/video><\/figure>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"1076\" style=\"aspect-ratio: 1920 \/ 1076;\" width=\"1920\" controls src=\"https:\/\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/Clip_EDM_sampler.mp4\"><\/video><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">The Karras <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>-schedule: spend more steps near the data<\/h3>\n\n\n\n<p class=\"\">The noise levels themselves are not linear or geometric but follow a <strong>power law<\/strong> parameterised by <math><semantics><mrow><mi>\u03c1<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\rho<\/annotation><\/semantics><\/math>:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><msub><mi>\u03c3<\/mi><mi>i<\/mi><\/msub><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mo fence=\"true\" maxsize=\"1.8em\" minsize=\"1.8em\" stretchy=\"true\">(<\/mo><msubsup><mi>\u03c3<\/mi><mi>max<\/mi><mo>\u2061<\/mo><mrow><mn>1<\/mn><mi mathvariant=\"normal\">\/<\/mi><mi>\u03c1<\/mi><\/mrow><\/msubsup><mo>+<\/mo><mstyle displaystyle=\"false\" scriptlevel=\"0\"><mfrac><mi>i<\/mi><mrow><mi>N<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/mfrac><\/mstyle><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><msubsup><mi>\u03c3<\/mi><mi>min<\/mi><mo>\u2061<\/mo><mrow><mn>1<\/mn><mi mathvariant=\"normal\">\/<\/mi><mi>\u03c1<\/mi><\/mrow><\/msubsup><mo>\u2212<\/mo><msubsup><mi>\u03c3<\/mi><mi>max<\/mi><mo>\u2061<\/mo><mrow><mn>1<\/mn><mi mathvariant=\"normal\">\/<\/mi><mi>\u03c1<\/mi><\/mrow><\/msubsup><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><msup><mo fence=\"true\" maxsize=\"1.8em\" minsize=\"1.8em\" stretchy=\"true\">)<\/mo><mi>\u03c1<\/mi><\/msup><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><mi>i<\/mi><mo>=<\/mo><mn>0<\/mn><mo separator=\"true\">,<\/mo><mo>\u2026<\/mo><mo separator=\"true\">,<\/mo><mi>N<\/mi><mo>\u2212<\/mo><mn>1.<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_i \\;=\\; \\Bigl(\\sigma_{\\max}^{1\/\\rho} + \\tfrac{i}{N-1}\\bigl(\\sigma_{\\min}^{1\/\\rho} &#8211; \\sigma_{\\max}^{1\/\\rho}\\bigr)\\Bigr)^{\\rho},\\qquad i = 0,\\dots,N-1.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">The defaults <math><semantics><mrow><msub><mi>\u03c3<\/mi><mi>min<\/mi><mo>\u2061<\/mo><\/msub><mo>=<\/mo><mn>0.002<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_{\\min} = 0.002<\/annotation><\/semantics><\/math>, <math><semantics><mrow><msub><mi>\u03c3<\/mi><mi>max<\/mi><mo>\u2061<\/mo><\/msub><mo>=<\/mo><mn>80<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_{\\max} = 80<\/annotation><\/semantics><\/math>, <math><semantics><mrow><msub><mi>\u03c3<\/mi><mtext>data<\/mtext><\/msub><mo>=<\/mo><mn>0.5<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_{\\text{data}} = 0.5<\/annotation><\/semantics><\/math>, and <math><semantics><mrow><mi>\u03c1<\/mi><mo>=<\/mo><mn>7<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\rho = 7<\/annotation><\/semantics><\/math> allocate disproportionately many steps to the low-<math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> end, where the sample is close to the denoised image. The motivation is twofold:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\"><strong>Close to the data manifold, the ODE trajectory has its highest curvature.<\/strong> Far from the data the dynamics are gentle and approximately linear (recall that the EDM tangent points to the denoiser output, which barely changes). Close to the data the trajectory &#8220;locks onto&#8221; a specific sample, and the denoiser output begins to commit to fine details; this is where the flow lines turn. Linear Euler steps are most error-prone in regions of high curvature, so we want fine resolution there.<\/li>\n\n\n\n<li class=\"\"><strong>Small errors near the data are perceptually catastrophic.<\/strong> An error of magnitude <math><semantics><mrow><mi>e<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">e<\/annotation><\/semantics><\/math> at small <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> produces a visible artifact in the final sample (low <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> means <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t<\/annotation><\/semantics><\/math> is almost the final image, and any deviation will appear in the output as-is). The same error at large <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> gets washed out by subsequent denoising. Putting more steps where they matter gives much better quality per NFE.<\/li>\n<\/ol>\n\n\n\n<p class=\"\">Increasing <math><semantics><mrow><mi>\u03c1<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\rho<\/annotation><\/semantics><\/math> makes the allocation more lopsided toward small <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>. The default <math><semantics><mrow><mi>\u03c1<\/mi><mo>=<\/mo><mn>7<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\rho = 7<\/annotation><\/semantics><\/math> was found by ablation; &#8220;Karras sigmas&#8221; with this schedule are now the default in essentially every modern image-diffusion pipeline.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_polunomial_step_size-scaled.png?ssl=1\"><img decoding=\"async\" width=\"2560\" height=\"1426\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_polunomial_step_size-scaled.png?fit=625%2C349&amp;ssl=1\" alt=\"\" class=\"wp-image-14301\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_polunomial_step_size-scaled.png?w=2560&amp;ssl=1 2560w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_polunomial_step_size-scaled.png?resize=300%2C167&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_polunomial_step_size-scaled.png?resize=1024%2C571&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_polunomial_step_size-scaled.png?resize=768%2C428&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_polunomial_step_size-scaled.png?resize=1536%2C856&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_polunomial_step_size-scaled.png?resize=2048%2C1141&amp;ssl=1 2048w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_polunomial_step_size-scaled.png?resize=624%2C348&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_polunomial_step_size-scaled.png?w=1250&amp;ssl=1 1250w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_polunomial_step_size-scaled.png?w=1875&amp;ssl=1 1875w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Training: loss weighting and the noise level distribution<\/h3>\n\n\n\n<p class=\"\">EDM&#8217;s training prescription is short but important enough that it deserves explicit treatment. Two choices complement the architectural preconditioning.<\/p>\n\n\n\n<p class=\"\">The <strong>noise level distribution<\/strong> during training is log-normal:<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi>ln<\/mi><mo>\u2061<\/mo><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>\u223c<\/mo><mtext>\u2005\u200a<\/mtext><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi>P<\/mi><mtext>mean<\/mtext><\/msub><mo separator=\"true\">,<\/mo><mtext>\u2009<\/mtext><msubsup><mi>P<\/mi><mtext>std<\/mtext><mn>2<\/mn><\/msubsup><mo stretchy=\"false\">)<\/mo><mo separator=\"true\">,<\/mo><mspace width=\"2em\"><\/mspace><msub><mi>P<\/mi><mtext>mean<\/mtext><\/msub><mo>=<\/mo><mo>\u2212<\/mo><mn>1.2<\/mn><mo separator=\"true\">,<\/mo><mtext>\u2005\u200a<\/mtext><msub><mi>P<\/mi><mtext>std<\/mtext><\/msub><mo>=<\/mo><mn>1.2.<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\ln(\\sigma) \\;\\sim\\; \\mathcal{N}(P_{\\text{mean}},\\,P_{\\text{std}}^2),\\qquad P_{\\text{mean}} = -1.2,\\;P_{\\text{std}} = 1.2.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">The motivation here is the mirror image of the sampling-schedule argument above. Empirically, the per-<math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> training loss has a roughly <em>U-shape<\/em>: at very low <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> (the input is almost clean) the denoising task is trivial (any reasonable network produces a near-identity mapping) and there is little gradient signal to learn from. At very high <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> (the input is mostly noise) the denoising task is essentially hopeless (no signal to recover) and gradients are again uninformative. The sweet spot, where the network can actually make progress and gradients carry real information, sits in the <em>middle<\/em> of the <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> range. The log-normal distribution above concentrates training samples in exactly this regime, putting computational effort where the model can actually learn something.<\/p>\n\n\n\n<p class=\"\">Note that the sampling and training schedules answer different questions. Sampling allocates more steps near small <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> because that is where <em>truncation error<\/em> is most damaging. Training allocates more samples near intermediate <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math> because that is where <em>gradient information<\/em> is richest. The two need not coincide, and in EDM they don&#8217;t.<\/p>\n\n\n\n<p class=\"\">The <strong>loss weighting<\/strong> that complements this is<\/p>\n\n\n\n<div class=\"wp-math-display\"><math display=\"block\"><semantics><mrow><mi>\u03bb<\/mi><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><mtext>\u2005\u200a<\/mtext><mo>=<\/mo><mtext>\u2005\u200a<\/mtext><mfrac><mrow><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><mo>+<\/mo><msubsup><mi>\u03c3<\/mi><mtext>data<\/mtext><mn>2<\/mn><\/msubsup><\/mrow><mrow><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo>\u22c5<\/mo><msub><mi>\u03c3<\/mi><mtext>data<\/mtext><\/msub><msup><mo stretchy=\"false\">)<\/mo><mn>2<\/mn><\/msup><\/mrow><\/mfrac><mi mathvariant=\"normal\">.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\lambda(\\sigma) \\;=\\; \\frac{\\sigma^2 + \\sigma_{\\text{data}}^2}{(\\sigma\\cdot\\sigma_{\\text{data}})^2}.<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"\">This is chosen to equalise the <em>gradient magnitude<\/em> across noise levels. Without proper weighting, a naive <math><semantics><mrow><mo stretchy=\"false\">\u2225<\/mo><msub><mi>F<\/mi><mi>\u03b8<\/mi><\/msub><mo>\u2212<\/mo><mtext>target<\/mtext><msup><mo stretchy=\"false\">\u2225<\/mo><mn>2<\/mn><\/msup><\/mrow><annotation encoding=\"application\/x-tex\">\\lVert F_\\theta &#8211; \\text{target}\\rVert^2<\/annotation><\/semantics><\/math> loss has gradients that vary by orders of magnitude across <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>, leading to highly imbalanced training: most updates push the network gently while occasional high-magnitude updates dominate. The weighting <math><semantics><mrow><mi>\u03bb<\/mi><mo stretchy=\"false\">(<\/mo><mi>\u03c3<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\lambda(\\sigma)<\/annotation><\/semantics><\/math>, combined with the unit-variance preconditioning, produces gradient updates of comparable magnitude at every <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>, which empirically gives much more stable training dynamics.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_noise-distirbution-scaled.png?ssl=1\"><img decoding=\"async\" width=\"2560\" height=\"1478\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_noise-distirbution-scaled.png?fit=625%2C361&amp;ssl=1\" alt=\"\" class=\"wp-image-14304\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_noise-distirbution-scaled.png?w=2560&amp;ssl=1 2560w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_noise-distirbution-scaled.png?resize=300%2C173&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_noise-distirbution-scaled.png?resize=1024%2C591&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_noise-distirbution-scaled.png?resize=768%2C443&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_noise-distirbution-scaled.png?resize=1536%2C887&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_noise-distirbution-scaled.png?resize=2048%2C1182&amp;ssl=1 2048w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_noise-distirbution-scaled.png?resize=624%2C360&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_noise-distirbution-scaled.png?w=1250&amp;ssl=1 1250w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2026\/05\/EDM_noise-distirbution-scaled.png?w=1875&amp;ssl=1 1875w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<div class=\"bg-blue-50\/50 dark:bg-blue-900\/20 p-4 rounded-lg border border-blue-100 dark:border-blue-800 my-4\">\n<p><em>An important observation on stochasticity, paraphrased from the EDM authors&#8217; talk.<\/em> The stochastic-churn sampler is empirically helpful, but it is a <em>double-edged sword<\/em>. The Langevin churn introduces its own discretisation error on top of the deterministic Heun step, so the net benefit depends on a delicate balance, and the optimal churn hyperparameters <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><msub><mi>S<\/mi><mtext>churn<\/mtext><\/msub><mo separator=\"true\">,<\/mo><msub><mi>S<\/mi><mtext>min<\/mtext><\/msub><mo separator=\"true\">,<\/mo><msub><mi>S<\/mi><mtext>max<\/mtext><\/msub><mo separator=\"true\">,<\/mo><msub><mi>S<\/mi><mtext>noise<\/mtext><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(S_{\\text{churn}}, S_{\\text{min}}, S_{\\text{max}}, S_{\\text{noise}})<\/annotation><\/semantics><\/math> need to be tuned per dataset and per architecture, with no clean principles for how to do this. Worse, churn can <em>mask bugs and modelling errors<\/em>: a model with a slight bias in its denoiser can still produce reasonable samples under enough exploration, even when its deterministic samples reveal that something is wrong.<\/p>\n<p>The authors&#8217; recommendation, which I have come to follow in my own protein-design work, is: develop and debug in fully deterministic mode (Heun on the PF-ODE), and add stochasticity as &#8220;the final cherry on top&#8221; only after you are confident the deterministic samples look right. As a calibration point, the EDM CIFAR-10 model benefits negligibly from churn at its trained quality (the network is good enough that any exploration injects more error than it fixes), while their ImageNet-64 model still benefits, a useful empirical pointer that the marginal value of churn diminishes as the score network gets better.<\/p>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\">EDM in the wild: the RFDiffusion3 inference sampler<\/h3>\n\n\n\n<p class=\"\">To make the relevance of all this concrete, let us look at how the <em>state-of-the-art protein-structure generator<\/em> RFDiffusion3 implements its sampler. The relevant code lives in <a href=\"https:\/\/github.com\/RosettaCommons\/foundry\/blob\/production\/models\/rfd3\/src\/rfd3\/model\/inference_sampler.py\"><code>inference_sampler.py<\/code><\/a> in the RosettaCommons <code>foundry<\/code> repository, in a method tellingly named <code>sample_diffusion_like_af3<\/code>. Stripping out the protein-specific bookkeeping (motif handling, classifier-free guidance, recycling, logging), the inner loop is essentially this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>for step_num, (c_t_minus_1, c_t) in enumerate(zip(noise_schedule, noise_schedule&#091;1:])):\n\n    # (1) Stochastic churn: inflate the current noise level by (1 + gamma).\n    gamma = self.gamma_0 if c_t &gt; self.gamma_min else 0\n    t_hat = c_t_minus_1 * (gamma + 1)\n\n    # (2) Inject noise to match the inflated level: epsilon ~ sqrt(t_hat^2 - t^2) * z.\n    epsilon_L = (\n        self.noise_scale\n        * torch.sqrt(torch.square(t_hat) - torch.square(c_t_minus_1))\n        * torch.normal(mean=0.0, std=1.0, size=X_L.shape, device=X_L.device)\n    )\n    X_noisy_L = X_L + epsilon_L\n\n    # (3) Call the preconditioned denoiser D_theta(x; t_hat) on the inflated state.\n    X_denoised_L = diffusion_module(X_noisy_L=X_noisy_L, t=t_hat.tile(D), ...)\n\n    # (4) EDM tangent direction: (x - D(x; t_hat)) \/ t_hat.\n    delta_L = (X_noisy_L - X_denoised_L) \/ t_hat   # \"gradient of x wrt. t at x_t_hat\"\n    d_t = c_t - t_hat\n\n    # (5) First-order Euler step along the EDM ODE.\n    X_L = X_noisy_L + step_scale * d_t * delta_L\n<\/code><\/pre>\n\n\n\n<p class=\"\">Each of these five lines is a <em>line-for-line transcription<\/em> of an equation we have just derived.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\"><strong>Lines (1) and (2)<\/strong> are EDM&#8217;s stochastic churn: <math><semantics><mrow><msub><mover accent=\"true\"><mi>t<\/mi><mo>^<\/mo><\/mover><mi>i<\/mi><\/msub><mo>=<\/mo><msub><mi>t<\/mi><mi>i<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>+<\/mo><msub><mi>\u03b3<\/mi><mi>i<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\hat{t}_i = t_i(1+\\gamma_i)<\/annotation><\/semantics><\/math> followed by <math><semantics><mrow><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mi>i<\/mi><\/msub><mo>=<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mi>i<\/mi><\/msub><mo>+<\/mo><msqrt><mrow><msubsup><mover accent=\"true\"><mi>t<\/mi><mo>^<\/mo><\/mover><mi>i<\/mi><mn>2<\/mn><\/msubsup><mo>\u2212<\/mo><msubsup><mi>t<\/mi><mi>i<\/mi><mn>2<\/mn><\/msubsup><\/mrow><\/msqrt><mo>\u22c5<\/mo><msub><mi>S<\/mi><mtext>noise<\/mtext><\/msub><mo>\u22c5<\/mo><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\hat{\\mathbf{x}}_i = \\mathbf{x}_i + \\sqrt{\\hat{t}_i^2 &#8211; t_i^2}\\cdot S_{\\text{noise}}\\cdot\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>. The thresholding <math><semantics><mrow><mi>\u03b3<\/mi><mo>=<\/mo><msub><mi>\u03b3<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\gamma = \\gamma_0<\/annotation><\/semantics><\/math> if <math><semantics><mrow><msub><mi>c<\/mi><mi>t<\/mi><\/msub><mo>&gt;<\/mo><msub><mi>\u03b3<\/mi><mtext>min<\/mtext><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">c_t &gt; \\gamma_\\text{min}<\/annotation><\/semantics><\/math> else <math><semantics><mrow><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">0<\/annotation><\/semantics><\/math> is exactly the <code>S_min<\/code>\/<code>S_max<\/code> window from the EDM paper, deciding <em>where<\/em> in the schedule to spend the churn budget.<\/li>\n\n\n\n<li class=\"\"><strong>Line (3)<\/strong> is the call to the preconditioned denoiser <math><semantics><mrow><msub><mi>D<\/mi><mi>\u03b8<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mover accent=\"true\"><mi>t<\/mi><mo>^<\/mo><\/mover><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">D_\\theta(\\mathbf{x}; \\hat{t})<\/annotation><\/semantics><\/math>. The <code>diffusion_module<\/code> here wraps the inner network with the analytic skip-connection and input\/output scalings we derived above.<\/li>\n\n\n\n<li class=\"\"><strong>Line (4)<\/strong> is <em>literally<\/em> the EDM ODE: <math><semantics><mrow><mi mathvariant=\"normal\">d<\/mi><mi mathvariant=\"bold\">x<\/mi><mi mathvariant=\"normal\">\/<\/mi><mi mathvariant=\"normal\">d<\/mi><mi>t<\/mi><mo>=<\/mo><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><mi>D<\/mi><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo separator=\"true\">;<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{d}\\mathbf{x}\/\\mathrm{d}t = (\\mathbf{x} &#8211; D(\\mathbf{x}; t))\/t<\/annotation><\/semantics><\/math>. The variable is even named <code>delta_L<\/code> and the comment in the source code reads &#8220;gradient of x wrt. t at x_t_hat&#8221;.<\/li>\n\n\n\n<li class=\"\"><strong>Line (5)<\/strong> is the Euler step <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>=<\/mo><msub><mover accent=\"true\"><mi mathvariant=\"bold\">x<\/mi><mo>^<\/mo><\/mover><mi>i<\/mi><\/msub><mo>+<\/mo><mo stretchy=\"false\">(<\/mo><msub><mi>t<\/mi><mrow><mi>i<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>t<\/mi><mo>^<\/mo><\/mover><mi>i<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">d<\/mi><mi>i<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_{i+1} = \\hat{\\mathbf{x}}_i + (t_{i+1} &#8211; \\hat{t}_i)\\,\\mathbf{d}_i<\/annotation><\/semantics><\/math>, with an extra <code>step_scale<\/code> knob that lets the user dial the step magnitude up or down (useful for trading sample diversity against fidelity in protein design).<\/li>\n<\/ul>\n\n\n\n<p class=\"\">Two small caveats worth flagging. First, RFDiffusion3 uses a <em>first-order<\/em> Euler integrator rather than the second-order Heun method recommended in the EDM paper; in practice the protein community has found that the extra denoiser evaluation per step is not worth its cost when the network is already expensive. Second, the <code>step_scale<\/code> and <code>noise_scale<\/code> multipliers are protein-design-specific knobs grafted on top of the pure EDM recipe to control downstream metrics like RMSD-to-motif and sample diversity, the kind of practical adjustment one always finds when transplanting a recipe from images to structures.<\/p>\n\n\n\n<p class=\"\">The point is that essentially every line of the inner sampling loop of a modern de-novo protein-design model is <em>exactly<\/em> the <a href=\"#ref-karras2022\">Karras et al.<\/a> EDM sampler from Section 4, with the same churn, the same <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"bold\">x<\/mi><mo>\u2212<\/mo><mi>D<\/mi><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\/<\/mi><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">(\\mathbf{x} &#8211; D)\/t<\/annotation><\/semantics><\/math> tangent direction, and the same preconditioned denoiser at its heart. Everything we have built up in this section is not retrospective taxonomy: it is the code path running every time someone designs a new protein with RFDiffusion3.<\/p>\n\n\n\n<div class=\"h-16\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">5. The Variance Schedule, Up Close<\/h2>\n\n\n\n<p class=\"\">Because every method we have discussed is, mathematically, a particular Gaussian probability path <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><mtext>\u2009<\/mtext><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo>+<\/mo><msub><mi>\u03c3<\/mi><mi>t<\/mi><\/msub><mtext>\u2009<\/mtext><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_t = \\alpha_t\\,\\mathbf{x}_0 + \\sigma_t\\,\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>, it is worth pausing to lay out side by side the choices of <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi>\u03c3<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(\\alpha_t, \\sigma_t)<\/annotation><\/semantics><\/math> that constitute the <em>variance schedule<\/em> in each framework.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Framework<\/th><th><math><semantics><mrow><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\alpha_t<\/annotation><\/semantics><\/math><\/th><th><math><semantics><mrow><msub><mi>\u03c3<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_t<\/annotation><\/semantics><\/math><\/th><th><math><semantics><mrow><msubsup><mi>\u03b1<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><mo>+<\/mo><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><\/mrow><annotation encoding=\"application\/x-tex\">\\alpha_t^2 + \\sigma_t^2<\/annotation><\/semantics><\/math> (when <math><semantics><mrow><mrow><mi mathvariant=\"normal\">V<\/mi><mi mathvariant=\"normal\">a<\/mi><mi mathvariant=\"normal\">r<\/mi><\/mrow><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\mathrm{Var}(\\mathbf{x}_0) = 1<\/annotation><\/semantics><\/math>)<\/th><th>Boundary at <math><semantics><mrow><mi>t<\/mi><mo>\u2192<\/mo><mi>T<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t \\to T<\/annotation><\/semantics><\/math><\/th><\/tr><\/thead><tbody><tr><td><strong>DDPM \/ VP, linear<\/strong><\/td><td><math><semantics><mrow><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{\\bar{\\alpha}_t}<\/annotation><\/semantics><\/math>, <math><semantics><mrow><msub><mi>\u03b2<\/mi><mi>t<\/mi><\/msub><mo>\u2208<\/mo><mo stretchy=\"false\">[<\/mo><msup><mn>10<\/mn><mrow><mo>\u2212<\/mo><mn>4<\/mn><\/mrow><\/msup><mo separator=\"true\">,<\/mo><mn>2<\/mn><mtext>\u2009\u2063<\/mtext><mo>\u22c5<\/mo><mtext>\u2009\u2063<\/mtext><msup><mn>10<\/mn><mrow><mo>\u2212<\/mo><mn>2<\/mn><\/mrow><\/msup><mo stretchy=\"false\">]<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\beta_t\\in[10^{-4},2\\!\\cdot\\!10^{-2}]<\/annotation><\/semantics><\/math> linear<\/td><td><math><semantics><mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{1-\\bar{\\alpha}_t}<\/annotation><\/semantics><\/math><\/td><td><math><semantics><mrow><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">= 1<\/annotation><\/semantics><\/math> (preserved)<\/td><td><math><semantics><mrow><mo>\u2192<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\to\\mathcal{N}(\\mathbf{0},\\mathbf{I})<\/annotation><\/semantics><\/math><\/td><\/tr><tr><td><strong>DDPM \/ VP, cosine<\/strong><\/td><td><math><semantics><mrow><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mo>=<\/mo><mi>cos<\/mi><mo>\u2061<\/mo><mtext>\u2009\u2063<\/mtext><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><mfrac><mrow><mi>t<\/mi><mi mathvariant=\"normal\">\/<\/mi><mi>T<\/mi><mo>+<\/mo><mi>s<\/mi><\/mrow><mrow><mn>1<\/mn><mo>+<\/mo><mi>s<\/mi><\/mrow><\/mfrac><mo>\u22c5<\/mo><mfrac><mi>\u03c0<\/mi><mn>2<\/mn><\/mfrac><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><mtext>\u2009\u2063<\/mtext><mi mathvariant=\"normal\">\/<\/mi><mtext>\u2009\u2063<\/mtext><mi>cos<\/mi><mo>\u2061<\/mo><mtext>\u2009\u2063<\/mtext><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">(<\/mo><mfrac><mi>s<\/mi><mrow><mn>1<\/mn><mo>+<\/mo><mi>s<\/mi><\/mrow><\/mfrac><mo>\u22c5<\/mo><mfrac><mi>\u03c0<\/mi><mn>2<\/mn><\/mfrac><mo fence=\"true\" maxsize=\"1.2em\" minsize=\"1.2em\" stretchy=\"true\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{\\bar{\\alpha}_t} = \\cos\\!\\bigl(\\tfrac{t\/T+s}{1+s}\\cdot\\tfrac{\\pi}{2}\\bigr)\\!\/\\!\\cos\\!\\bigl(\\tfrac{s}{1+s}\\cdot\\tfrac{\\pi}{2}\\bigr)<\/annotation><\/semantics><\/math><\/td><td><math><semantics><mrow><msqrt><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{1-\\bar{\\alpha}_t}<\/annotation><\/semantics><\/math><\/td><td><math><semantics><mrow><mo>=<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">= 1<\/annotation><\/semantics><\/math> (preserved)<\/td><td><math><semantics><mrow><mo>\u2192<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\to\\mathcal{N}(\\mathbf{0},\\mathbf{I})<\/annotation><\/semantics><\/math><\/td><\/tr><tr><td><strong>NCSN \/ VE<\/strong><\/td><td><math><semantics><mrow><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">1<\/annotation><\/semantics><\/math><\/td><td><math><semantics><mrow><mi>\u03c3<\/mi><mo stretchy=\"false\">(<\/mo><mi>t<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma(t)<\/annotation><\/semantics><\/math>, geometric or polynomial<\/td><td><math><semantics><mrow><mn>1<\/mn><mo>+<\/mo><msubsup><mi>\u03c3<\/mi><mi>t<\/mi><mn>2<\/mn><\/msubsup><\/mrow><annotation encoding=\"application\/x-tex\">1 + \\sigma_t^2<\/annotation><\/semantics><\/math> (exploding)<\/td><td><math><semantics><mrow><mo>\u2192<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><mo separator=\"true\">,<\/mo><msubsup><mi>\u03c3<\/mi><mi>max<\/mi><mo>\u2061<\/mo><mn>2<\/mn><\/msubsup><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\to\\mathcal{N}(\\mathbf{x}_0,\\sigma_{\\max}^2\\mathbf{I})<\/annotation><\/semantics><\/math><\/td><\/tr><tr><td><strong>sub-VP<\/strong><\/td><td><math><semantics><mrow><msqrt><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><\/mrow><annotation encoding=\"application\/x-tex\">\\sqrt{\\bar{\\alpha}_t}<\/annotation><\/semantics><\/math><\/td><td><math><semantics><mrow><mn>1<\/mn><mo>\u2212<\/mo><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">1-\\bar{\\alpha}_t<\/annotation><\/semantics><\/math> (note: no square root)<\/td><td><math><semantics><mrow><mo>&lt;<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">&lt; 1<\/annotation><\/semantics><\/math> at intermediate <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math><\/td><td><math><semantics><mrow><mo>\u2192<\/mo><mi mathvariant=\"script\">N<\/mi><mo stretchy=\"false\">(<\/mo><mn mathvariant=\"bold\">0<\/mn><mo separator=\"true\">,<\/mo><mi mathvariant=\"bold\">I<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\to\\mathcal{N}(\\mathbf{0},\\mathbf{I})<\/annotation><\/semantics><\/math><\/td><\/tr><tr><td><strong>EDM<\/strong><\/td><td><math><semantics><mrow><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">1<\/annotation><\/semantics><\/math><\/td><td><math><semantics><mrow><mi>\u03c3<\/mi><mo>\u2208<\/mo><mo stretchy=\"false\">[<\/mo><mn>0.002<\/mn><mo separator=\"true\">,<\/mo><mn>80<\/mn><mo stretchy=\"false\">]<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma\\in[0.002,80]<\/annotation><\/semantics><\/math> via <math><semantics><mrow><mi>\u03c1<\/mi><mo>=<\/mo><mn>7<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\rho=7<\/annotation><\/semantics><\/math> power law<\/td><td><math><semantics><mrow><mn>1<\/mn><mo>+<\/mo><msup><mi>\u03c3<\/mi><mn>2<\/mn><\/msup><\/mrow><annotation encoding=\"application\/x-tex\">1 + \\sigma^2<\/annotation><\/semantics><\/math> (VE-like)<\/td><td>truncated at <math><semantics><mrow><msub><mi>\u03c3<\/mi><mi>max<\/mi><mo>\u2061<\/mo><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma_{\\max}<\/annotation><\/semantics><\/math>, not <math><semantics><mrow><mi mathvariant=\"normal\">\u221e<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\infty<\/annotation><\/semantics><\/math><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"\">A few comments worth internalising. <strong>Linear vs cosine<\/strong> is a small but consequential change: <a href=\"#ref-nichol2021\">Nichol &amp; Dhariwal (2021)<\/a> noted that the linear schedule destroys the signal far too aggressively in the early-to-middle timesteps for low-resolution images, leaving the network nothing to learn during the late part of training; the cosine schedule keeps <math><semantics><mrow><msub><mover accent=\"true\"><mi>\u03b1<\/mi><mo>\u02c9<\/mo><\/mover><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\bar{\\alpha}_t<\/annotation><\/semantics><\/math> near 1 for longer and then collapses smoothly. <strong>VP vs VE<\/strong> is a more fundamental geometric choice: under VP the trajectory stays inside a unit ball, under VE it expands outward. EDM is essentially VE plus principled preconditioning, which restores numerical stability and lets the network see well-scaled inputs at every <math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>. Finally, <strong>sub-VP<\/strong> sacrifices a small amount of sample quality for a sharper concentration around <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math> at intermediate <math><semantics><mrow><mi>t<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math>, improving likelihood.<\/p>\n\n\n\n<div class=\"h-16\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">6. Connections, Tradeoffs, and Where the Field Sits<\/h2>\n\n\n\n<p class=\"\">It is now possible to draw the picture cleanly. Every method we have discussed corresponds to a choice of three things:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\">A <strong>probability path<\/strong> <math><semantics><mrow><mo stretchy=\"false\">{<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">}<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\{p_t\\}<\/annotation><\/semantics><\/math> interpolating noise and data, defined by <math><semantics><mrow><mo stretchy=\"false\">(<\/mo><msub><mi>\u03b1<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi>\u03c3<\/mi><mi>t<\/mi><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(\\alpha_t, \\sigma_t)<\/annotation><\/semantics><\/math>. VP, VE, sub-VP, and EDM-VE are different paths.<\/li>\n\n\n\n<li class=\"\">A <strong>regression target<\/strong> (<math><semantics><mrow><mi mathvariant=\"bold-italic\">\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\boldsymbol{\\epsilon}<\/annotation><\/semantics><\/math>, <math><semantics><mrow><msub><mi mathvariant=\"bold\">x<\/mi><mn>0<\/mn><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbf{x}_0<\/annotation><\/semantics><\/math>, or score <math><semantics><mrow><mi mathvariant=\"normal\">\u2207<\/mi><mi>log<\/mi><mo>\u2061<\/mo><msub><mi>p<\/mi><mi>t<\/mi><\/msub><\/mrow><annotation encoding=\"application\/x-tex\">\\nabla\\log p_t<\/annotation><\/semantics><\/math>), all of which are related by exact closed-form linear transformations once the path and prior are fixed.<\/li>\n\n\n\n<li class=\"\">A <strong>sampler<\/strong>, either an SDE solver (DDPM ancestral, reverse-SDE Euler\u2013Maruyama, EDM-stochastic), a deterministic ODE solver (DDIM, Heun, RK4), or a hybrid (predictor\u2013corrector, EDM-churn).<\/li>\n<\/ol>\n\n\n\n<p class=\"\">The historical narrative is essentially that successive papers separated these three choices and showed that mixing them productively gives better samples. DDPM coupled VP, <math><semantics><mrow><mi>\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\epsilon<\/annotation><\/semantics><\/math>-prediction, and ancestral SDE sampling. DDIM kept VP and <math><semantics><mrow><mi>\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\epsilon<\/annotation><\/semantics><\/math> but swapped the sampler for an ODE. <a href=\"#ref-song2021b\">Song et al. (2021b)<\/a> showed all of these are discretisations of a continuous SDE and that the same model admits both stochastic and deterministic samplers, and introduced VE and sub-VP. EDM then showed that with the right preconditioning and ODE solver, the difference between VP and VE essentially vanishes and most empirical gains come from sampler quality and noise-level distribution at training.<\/p>\n\n\n\n<p class=\"\">The <strong>stochastic vs deterministic<\/strong> tradeoff is worth spelling out. Stochastic samplers (DDPM, reverse-SDE, EDM with churn) act as their own error correctors: noise injection at intermediate steps washes out accumulated drift error from the score network, often producing <em>better<\/em> FID at high NFE than the corresponding deterministic ODE solver on the same model. Deterministic samplers (DDIM, Heun on the PF-ODE) are faster per quality at low NFE, are invertible (useful for editing, latent interpolation, likelihood evaluation), and are reproducible. <a href=\"#ref-karras2022\">Karras et al. (2022)<\/a> systematically characterise this crossover; in my own protein-design work I tend to use the deterministic Heun sampler for NFE budgets under ~30 and switch to stochastic-churn variants when I have NFE to spare or am worried about mode coverage.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Comparison of sampling methods<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Method<\/th><th>Year<\/th><th>Key idea<\/th><th>Sampling type<\/th><th>Pros<\/th><th>When to use<\/th><\/tr><\/thead><tbody><tr><td><strong>DDPM<\/strong> (<a href=\"#ref-ho2020\">Ho et al.<\/a>)<\/td><td>2020<\/td><td>Reverse Markov chain, <math><semantics><mrow><mi>\u03f5<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\epsilon<\/annotation><\/semantics><\/math>-prediction, linear <math><semantics><mrow><mi>\u03b2<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\beta<\/annotation><\/semantics><\/math><\/td><td>Stochastic (ancestral)<\/td><td>Simple, foundational, robust training<\/td><td>Reference implementations; teaching<\/td><\/tr><tr><td><strong>DDIM<\/strong> (<a href=\"#ref-song2021a\">Song et al.<\/a>)<\/td><td>2021<\/td><td>Non-Markovian forward, <math><semantics><mrow><mi>\u03b7<\/mi><mo>=<\/mo><mn>0<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\eta=0<\/annotation><\/semantics><\/math> deterministic<\/td><td>Deterministic (or interpolated)<\/td><td>10 to 50 times fewer steps than DDPM, invertible, latent interpolation<\/td><td>Fast inference with an existing DDPM-trained model<\/td><\/tr><tr><td><strong>NCSN \/ VE-SDE<\/strong> (<a href=\"#ref-songermon2019\">Song &amp; Ermon<\/a>; <a href=\"#ref-song2021b\">Song et al.<\/a>)<\/td><td>2019 \/ 2021<\/td><td>Score matching at multiple noise scales; Langevin sampling<\/td><td>Stochastic (Langevin \/ SDE)<\/td><td>Strong likelihoods; PF-ODE for exact density<\/td><td>High-noise regimes; likelihood-critical applications<\/td><\/tr><tr><td><strong>VP-SDE \/ sub-VP<\/strong> (<a href=\"#ref-song2021b\">Song et al.<\/a>)<\/td><td>2021<\/td><td>Continuous-time DDPM via SDE; PC sampler; PF-ODE<\/td><td>Both<\/td><td>Unified framework; principled likelihood; PC samplers<\/td><td>When you want the SDE formalism (e.g. inverse problems, guidance)<\/td><\/tr><tr><td><strong>EDM<\/strong> (<a href=\"#ref-karras2022\">Karras et al.<\/a>)<\/td><td>2022<\/td><td><math><semantics><mrow><mi>\u03c3<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">\\sigma<\/annotation><\/semantics><\/math>-parameterisation, preconditioning, Heun + churn<\/td><td>Both<\/td><td>Best FID per NFE; clean code; modular<\/td><td>Modern default for image \/ structure diffusion; my default in protein work<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"\">The current state of the art in protein-structure generation is, at the time of writing in May 2026, a mixture of these ingredients. Backbone-generation models like RFDiffusion descend directly from the score-SDE \/ EDM lineage. The point I would press on a student starting the field now is that the <em>training<\/em> objectives across all of these are nearly identical regression losses on linear reparameterisations of the same quantity (noise, clean data, or score), and what really differs between methods is the path, the sampler, and the symmetry group baked into the network. Pick the path that gives the geometry you want; pick the sampler that matches your NFE budget; train the same kind of MSE in every case.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Where to look next<\/h3>\n\n\n\n<p class=\"\">The story above stops short of an important and very active branch of the field: <strong>flow matching<\/strong> and its relatives (rectified flow, OT-CFM). These methods drop the diffusion construction entirely and learn the velocity field of an ODE that transports noise to data along arbitrary, often straight-line, probability paths. They have become dominant in protein-structure generation since roughly 2023 and deserve their own dedicated treatment. For a broader and excellent overview of this family I recommend the <a href=\"https:\/\/diffusion.csail.mit.edu\/2026\/docs\/lecture_notes.pdf\">MIT 6.S979 lecture notes on diffusion and flow-based generative models<\/a>, which cover flow matching, conditional flow matching, and the connections back to the score-based picture we have built up here.<\/p>\n\n\n\n<div class=\"h-16\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"\"><span id=\"ref-anderson1982\"><\/span>Anderson, B. D. O. (1982). <a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/0304414982900515\"><em>Reverse-time diffusion equation models<\/em><\/a>. Stochastic Processes and Their Applications 12(3), 313\u2013326.<\/li>\n\n\n\n<li class=\"\"><span id=\"ref-feller1949\"><\/span>Feller, W. (1949). <a href=\"https:\/\/digitalassets.lib.berkeley.edu\/math\/ucb\/text\/math_s1_article-27.pdf\"><em>On the theory of stochastic processes, with particular reference to applications<\/em><\/a>. Proceedings of the (First) Berkeley Symposium on Mathematical Statistics and Probability, 403\u2013432.<\/li>\n\n\n\n<li class=\"\"><span id=\"ref-ho2020\"><\/span>Ho, J., Jain, A., &amp; Abbeel, P. (2020). <a href=\"https:\/\/arxiv.org\/abs\/2006.11239\"><em>Denoising Diffusion Probabilistic Models<\/em><\/a>. NeurIPS 2020. arXiv:2006.11239.<\/li>\n\n\n\n<li class=\"\"><span id=\"ref-hyvarinen2005\"><\/span>Hyv\u00e4rinen, A. (2005). <a href=\"https:\/\/jmlr.org\/papers\/v6\/hyvarinen05a.html\"><em>Estimation of non-normalized statistical models by score matching<\/em><\/a>. Journal of Machine Learning Research 6, 695\u2013709.<\/li>\n\n\n\n<li class=\"\"><span id=\"ref-karras2022\"><\/span>Karras, T., Aittala, M., Aila, T., &amp; Laine, S. (2022). <a href=\"https:\/\/arxiv.org\/abs\/2206.00364\"><em>Elucidating the Design Space of Diffusion-Based Generative Models<\/em><\/a>. NeurIPS 2022. arXiv:2206.00364.<\/li>\n\n\n\n<li class=\"\"><span id=\"ref-mit2026\"><\/span>MIT 6.S979 (2026). <a href=\"https:\/\/diffusion.csail.mit.edu\/2026\/docs\/lecture_notes.pdf\"><em>Lecture notes on diffusion and flow-based generative models<\/em><\/a>.<\/li>\n\n\n\n<li class=\"\"><span id=\"ref-nichol2021\"><\/span>Nichol, A., &amp; Dhariwal, P. (2021). <a href=\"https:\/\/arxiv.org\/abs\/2102.09672\"><em>Improved Denoising Diffusion Probabilistic Models<\/em><\/a>. ICML 2021. arXiv:2102.09672.<\/li>\n\n\n\n<li class=\"\"><span id=\"ref-sohldickstein2015\"><\/span>Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., &amp; Ganguli, S. (2015). <a href=\"https:\/\/arxiv.org\/abs\/1503.03585\"><em>Deep Unsupervised Learning using Nonequilibrium Thermodynamics<\/em><\/a>. ICML 2015. arXiv:1503.03585.<\/li>\n\n\n\n<li class=\"\"><span id=\"ref-song2021a\"><\/span>Song, J., Meng, C., &amp; Ermon, S. (2021a). <a href=\"https:\/\/arxiv.org\/abs\/2010.02502\"><em>Denoising Diffusion Implicit Models<\/em><\/a>. ICLR 2021. arXiv:2010.02502.<\/li>\n\n\n\n<li class=\"\"><span id=\"ref-songermon2019\"><\/span>Song, Y., &amp; Ermon, S. (2019). <a href=\"https:\/\/arxiv.org\/abs\/1907.05600\"><em>Generative Modeling by Estimating Gradients of the Data Distribution<\/em><\/a>. NeurIPS 2019. arXiv:1907.05600.<\/li>\n\n\n\n<li class=\"\"><span id=\"ref-song2021b\"><\/span>Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., &amp; Poole, B. (2021b). <a href=\"https:\/\/arxiv.org\/abs\/2011.13456\"><em>Score-Based Generative Modeling through Stochastic Differential Equations<\/em><\/a>. ICLR 2021. arXiv:2011.13456.<\/li>\n\n\n\n<li class=\"\"><span id=\"ref-vincent2011\"><\/span>Vincent, P. (2011). <a href=\"https:\/\/www.iro.umontreal.ca\/~vincentp\/Publications\/smdae_techreport.pdf\"><em>A Connection Between Score Matching and Denoising Autoencoders<\/em><\/a>. Neural Computation 23(7), 1661\u20131674.<\/li>\n\n\n\n<li class=\"\"><span id=\"ref-watson2023\"><\/span>Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., Ahern, W., Borst, A. J., Ragotte, R. J., Milles, L. F., et al. (2023). <a href=\"https:\/\/www.nature.com\/articles\/s41586-023-06415-8\"><em>De novo design of protein structure and function with RFdiffusion<\/em><\/a>. Nature 620, 1089\u20131100.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>When approaching the methods used in de-novo protein design, one is quickly confronted with a plethora of overlapping formulations of what looks superficially like &#8220;the same thing&#8221;. One paper trains an \u03f5\\boldsymbol{\\epsilon}-prediction network with a simple MSE loss; another trains a score network with a stochastic-differential-equation justification; a third trains a clean-data predictor under yet [&hellip;]<\/p>\n","protected":false},"author":146,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[],"ppma_author":[933],"class_list":["post-14282","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":933,"user_id":146,"is_guest":0,"slug":"lorenzo","display_name":"Lorenzo Tarricone","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/bb743b2ea1b5c047d05b2cfd55633a1fe9be32aa960b5370f1b1d9536c996bb7?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/14282","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/146"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=14282"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/14282\/revisions"}],"predecessor-version":[{"id":14307,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/14282\/revisions\/14307"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=14282"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=14282"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=14282"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=14282"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}