Diffusion Models#

Forward Perturbation and Reverse Generation#

VAEs and GANs are historically fundamental, but they do not exhaust the modern theory of generative modeling. In recent years, diffusion and score-based models have become central in imaging because they combine high sample quality with a relatively stable training procedure. They are also particularly interesting from a mathematical point of view, because they connect denoising, stochastic processes, and Bayesian inference.

For students, diffusion models often look mysterious at first. The best way to introduce them is not to start from the final algorithm, but from the basic guiding idea: if generating complex images directly is difficult, perhaps it is easier to learn how to reverse a gradual noising process.

Forward diffusion.

Let \(\boldsymbol{x}_0 \sim p_{\mathrm{data}}\) be a sample from the image distribution. The forward diffusion process progressively corrupts this sample by adding Gaussian noise over many steps:

\[ q(\boldsymbol{x}_t| \boldsymbol{x}_{t-1}) = \mathcal{N}\big(\sqrt{1-\beta_t}\,\boldsymbol{x}_{t-1},\beta_t I\big), \]

where the sequence \((\beta_t)\) is called the variance schedule.

The meaning of this construction is simple. For small \(t\), the sample still resembles the original image. For large \(t\), the signal is progressively destroyed and the distribution approaches a simple Gaussian law.

One of the most important facts is that this multi-step process admits the closed form

\[ \boldsymbol{x}_t = \sqrt{\bar{\alpha}_t}\,\boldsymbol{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\varepsilon}, \qquad \boldsymbol{\varepsilon}\sim\mathcal{N}(0,I), \]

with

\[ \alpha_t=1-\beta_t, \qquad \bar{\alpha}_t=\prod_{s=1}^t \alpha_s. \]

This formula is extremely useful because it means we can generate a noisy sample at any time step directly from the clean image, without simulating the full chain step by step.

import torch
import numpy as np
from pathlib import Path

def course_asset_path(name):
    here = Path.cwd().resolve()
    for base in (here, here.parent, here.parent.parent):
        candidate = base / 'imgs' / name
        if candidate.exists():
            return candidate
    raise FileNotFoundError(f'Could not locate imgs/{name} from {here}')
from PIL import Image
from IPython.display import display

img_path = course_asset_path('GoPro.jpg')
if img_path.exists():
    image = Image.open(img_path).convert('L').resize((160, 160))
    x0 = torch.tensor(np.array(image), dtype=torch.float32) / 255.0
else:
    x0 = torch.zeros(160, 160)
    x0[40:120, 54:106] = 1.0

schedule = [0.05, 0.20, 0.45, 0.70]
panels = []
for noise_level in schedule:
    noisy = torch.sqrt(torch.tensor(1.0 - noise_level)) * x0 + torch.sqrt(torch.tensor(noise_level)) * torch.randn_like(x0)
    noisy = noisy.clamp(0.0, 1.0)
    panels.append(Image.fromarray((255 * noisy.numpy()).astype(np.uint8)))

strip = Image.new('L', (160 * len(panels), 160))
for i, panel in enumerate(panels):
    strip.paste(panel, (160 * i, 0))

display(strip)
print('Noise levels from left to right:', schedule)
../_images/51dfe6293c0fbe4490839fe2f8667dfba2cf9f99fca4500f362ff2b24d5144ec.png
Noise levels from left to right: [0.05, 0.2, 0.45, 0.7]

The reverse problem.

Once the forward noising mechanism is fixed, the generative task becomes the reverse one: starting from Gaussian noise, recover a sample distributed like a clean image.

In principle, one would like to model the reverse transitions

\[ p_{\boldsymbol{\Theta}}(\boldsymbol{x}_{t-1}| \boldsymbol{x}_t). \]

If these reverse conditionals were known exactly, one could sample \(\boldsymbol{x}_T \sim \mathcal{N}(0,I)\) and then successively denoise until reaching a realistic image.

The remarkable insight of diffusion models is that one does not need to learn arbitrary reverse transitions directly. It is enough to learn the structure of denoising at each noise scale.

The noise prediction objective.

In the DDPM formulation [9], a neural network \(\boldsymbol{\varepsilon}_{\boldsymbol{\Theta}}(\boldsymbol{x}_t,t)\) is trained to predict the noise used to generate \(\boldsymbol{x}_t\). The standard loss is

\[ \mathcal{L}_{\mathrm{DDPM}}(\boldsymbol{\Theta}) = \mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{\varepsilon},t} \Big[ \|\boldsymbol{\varepsilon}-\boldsymbol{\varepsilon}_{\boldsymbol{\Theta}}(\boldsymbol{x}_t,t)\|_2^2 \Big]. \]

This objective looks almost deceptively simple. One samples a clean image, chooses a time step, corrupts the image with Gaussian noise, and asks the network to predict the noise component. Yet this simple denoising problem is enough to learn a powerful generative model.

Pedagogically, this is worth emphasizing. Diffusion training is effective because it decomposes a difficult global generation problem into many local denoising tasks of varying difficulty.

# Tiny denoiser training experiment at one noise level.
from pathlib import Path

def course_asset_path(name):
    here = Path.cwd().resolve()
    for base in (here, here.parent, here.parent.parent):
        candidate = base / 'imgs' / name
        if candidate.exists():
            return candidate
    raise FileNotFoundError(f'Could not locate imgs/{name} from {here}')
from PIL import Image
import numpy as np
import torch

torch.manual_seed(0)

img = Image.open(course_asset_path('GoPro.jpg')).convert('L').resize((32, 32))
x = torch.tensor(np.array(img), dtype=torch.float32) / 255.0
patches = []
for i in range(0, 24, 8):
    for j in range(0, 24, 8):
        patches.append(x[i:i+16, j:j+16])
clean = torch.stack(patches).unsqueeze(1)

sigma = 0.25
noise = torch.randn_like(clean)
noisy = clean + sigma * noise

model = torch.nn.Sequential(
    torch.nn.Conv2d(1, 16, 3, padding=1),
    torch.nn.ReLU(),
    torch.nn.Conv2d(16, 16, 3, padding=1),
    torch.nn.ReLU(),
    torch.nn.Conv2d(16, 1, 3, padding=1),
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

for epoch in range(150):
    pred_noise = model(noisy)
    loss = torch.mean((pred_noise - noise) ** 2)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if epoch in [0, 19, 79, 149]:
        print(f'Epoch {epoch + 1:03d} | noise-prediction loss = {loss.item():.6f}')

with torch.no_grad():
    denoised = noisy - sigma * model(noisy)
    noisy_rmse = torch.mean((noisy - clean) ** 2).sqrt().item()
    denoised_rmse = torch.mean((denoised - clean) ** 2).sqrt().item()
print(f'Noisy RMSE: {noisy_rmse:.6f}')
print(f'Denoised RMSE: {denoised_rmse:.6f}')
Epoch 001 | noise-prediction loss = 1.034992
Epoch 020 | noise-prediction loss = 0.270470
Epoch 080 | noise-prediction loss = 0.166430
Epoch 150 | noise-prediction loss = 0.129830
Noisy RMSE: 0.252191
Denoised RMSE: 0.089994

This is not yet a full diffusion model, because the noise level is fixed rather than randomized over many time steps. But it isolates the key learning problem behind DDPM training: predict the corruption so that one can move back toward the clean image.

Score-Based Interpretation and Sampling#

It is helpful to explain the intuition behind the loss. If the network can reliably infer which part of \(\boldsymbol{x}_t\) is signal and which part is noise, then it can estimate how to move back toward the clean data manifold. Repeating this operation across multiple scales gradually reconstructs an image from pure noise.

Thus, diffusion models can be introduced to students as a hierarchy of denoisers linked together by a probabilistic time evolution.

Score-based interpretation.

There is a second, deeper interpretation. Instead of predicting the noise directly, one may think of the model as learning the score

\[ \nabla_\boldsymbol{x} \log p_t(\boldsymbol{x}), \]

namely the gradient of the log density of the noisy image distribution at time \(t\).

In continuous-time score-based modeling [15], one trains a network

\[ s_{\boldsymbol{\Theta}}(\boldsymbol{x}_t,t)\approx \nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t). \]

The score indicates the direction in which probability density increases. Therefore, if one knows the score field, one knows how to move noisy samples toward more likely clean images.

This viewpoint is conceptually powerful because it links diffusion models to classical statistical objects such as likelihood gradients.

Note

A score is a gradient of log density, not a reconstruction itself. This is one of the conceptual hurdles for students: the model is not learning the final image directly, but a direction field that points toward higher probability regions.

Tweedie’s formula.

One of the most useful identities in this area is Tweedie’s formula. Under Gaussian perturbations, the posterior mean of the clean image given the noisy one satisfies

\[ \mathbb{E}[\boldsymbol{x}_0| \boldsymbol{x}_t] = \boldsymbol{x}_t+\sigma_t^2 \nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t). \]

This formula tells us that once we know the score, we also know the conditional mean denoiser. It is exactly this bridge between denoising and score estimation that makes diffusion models so useful for inverse problems.

From a teaching perspective, this is a crucial milestone. It shows that diffusion models are not just image generators. They encode differential information about the image prior.

import torch

sigma = 0.7
x_t = torch.linspace(-3.0, 3.0, 7)
score = -x_t / (1.0 + sigma**2)
posterior_mean = x_t + sigma**2 * score
closed_form = x_t / (1.0 + sigma**2)

print('Posterior mean from Tweedie:', posterior_mean)
print('Closed-form posterior mean:', closed_form)
print('Maximum absolute difference:', float((posterior_mean - closed_form).abs().max()))
Posterior mean from Tweedie: tensor([-2.0134, -1.3423, -0.6711,  0.0000,  0.6711,  1.3423,  2.0134])
Closed-form posterior mean: tensor([-2.0134, -1.3423, -0.6711,  0.0000,  0.6711,  1.3423,  2.0134])
Maximum absolute difference: 2.384185791015625e-07

Reverse sampling.

Once the score or noise predictor has been learned, one can generate samples by approximately simulating the reverse dynamics. There are several possibilities.

Stochastic sampling.

One follows a reverse-time stochastic process whose drift depends on the learned score. This tends to preserve the probabilistic nature of the model.

Deterministic sampling.

Methods such as DDIM or probability-flow ODE samplers replace the stochastic reverse process with a deterministic trajectory. These methods often reduce the number of function evaluations needed to obtain a good sample.

This distinction is worth explaining because it prepares the transition to flow matching later in the course.

Why Diffusion Models Matter for Imaging and Inverse Problems#

There are several reasons for their success.

First, the training loss is stable compared with adversarial methods. One does not need to solve a minimax game as in GANs.

Second, the model naturally learns features across many noise scales. This is particularly valuable in imaging, where structures may appear at very different intensities and resolutions.

Third, the denoiser architecture used in practice is often a UNet. This brings all the benefits of multiscale image processing into the generative model.

Fourth, diffusion models can be adapted naturally to conditional settings, which is exactly what inverse problems require.

The main cost of diffusion models.

The main drawback is computational cost. Sampling usually requires many denoising steps, and each step involves a large neural network evaluation. Compared with a single forward pass in a GAN or VAE, this can be much slower.

This is not a minor inconvenience. In scientific imaging, inference speed can matter greatly. It is one of the reasons the field has invested so much effort in faster samplers and alternative transport-based generative models.

Why diffusion models are so relevant for inverse problems.

For inverse problems, the central quantity of interest is often the posterior distribution

\[ p(\boldsymbol{x}| \boldsymbol{y}^\delta) \propto p(\boldsymbol{y}^\delta| \boldsymbol{x})p(\boldsymbol{x}). \]

Classical regularization methods usually provide only a point estimate. Diffusion models, by contrast, can approximate the prior part of this posterior through score information. Since the posterior score decomposes as

\[ \nabla_\boldsymbol{x} \log p(\boldsymbol{x}| \boldsymbol{y}^\delta) = \nabla_\boldsymbol{x} \log p(\boldsymbol{y}^\delta| \boldsymbol{x}) + \nabla_\boldsymbol{x} \log p(\boldsymbol{x}), \]

the diffusion model supplies the prior score term, while the forward model supplies the likelihood score term.

This is the moment in the teaching roadmap where generative modeling reconnects directly with inverse problems.

# Tiny posterior correction step combining a prior-like denoiser and a likelihood gradient.
import torch

torch.manual_seed(0)

A = torch.tensor([[1.0, 0.0], [0.0, 0.5]])
y = torch.tensor([0.8, -0.2])
x = torch.tensor([1.5, -1.0])
step = 0.2
sigma2 = 0.1

prior_score = -x  # Gaussian prior score for a toy example.
likelihood_score = -(A.T @ (A @ x - y)) / sigma2
posterior_update = x + step * (prior_score + likelihood_score)

print('Current state:', x)
print('Prior score:', prior_score)
print('Likelihood score:', likelihood_score)
print('One posterior-guided update:', posterior_update)
Current state: tensor([ 1.5000, -1.0000])
Prior score: tensor([-1.5000,  1.0000])
Likelihood score: tensor([-7.0000,  1.5000])
One posterior-guided update: tensor([-0.2000, -0.5000])

In a true diffusion posterior sampler, the prior score would come from a trained score network rather than from this simplified Gaussian expression. The point of the example is to display the structure of the update, not to imitate a full-scale algorithm.

Summary#

The main ideas that students should retain are:

  • diffusion models learn to reverse a progressive Gaussian corruption process;

  • the DDPM loss trains a hierarchy of denoisers across noise scales;

  • the score-based viewpoint connects the model with gradients of log densities;

  • Tweedie’s formula links score estimation to posterior mean denoising;

  • diffusion models are powerful in imaging because they provide a strong learned prior, but they are computationally expensive at sampling time.

Exercises#

  1. Derive the closed-form expression for (oldsymbol{x}_t) in the forward diffusion process.

  2. Explain in words why predicting noise can be enough to learn a generative model.

  3. State the meaning of Tweedie’s formula in the context of denoising.

  4. Why are diffusion models particularly attractive for inverse problems despite their sampling cost?

Further Reading#

Diffusion models are best understood by connecting three viewpoints: denoising, stochastic dynamics, and score estimation. Students who want a deeper grasp should compare the discrete DDPM presentation with the continuous score-based SDE viewpoint and keep asking how the two descriptions encode the same underlying prior information.

A useful challenge is to trace exactly where the forward operator of an inverse problem enters once a diffusion model is used for posterior-guided reconstruction.