Flow Matching Models

Flow Matching Models#

Probability Paths and Learned Transport#

Note

From a teaching viewpoint, flow matching is valuable because it isolates the idea of learning a transport field without forcing students to begin from a stochastic reverse process.

Once students understand diffusion models, there is a natural next question: is stochastic denoising the only way to transport a simple distribution into a complex image distribution? Flow matching answers this question in the negative [13].

The main conceptual shift is the following. Diffusion models generate by reversing a noising process. Flow matching models instead learn a deterministic time-dependent vector field that transports samples from a simple base distribution to the data distribution.

This is important pedagogically because it shows that modern generative modeling is not tied to a single probabilistic mechanism. One can learn density structure either through stochastic score dynamics or through deterministic transport.

Probability paths.

Let \(\boldsymbol{x}_0 \sim p_0\) be sampled from a simple base law, usually Gaussian, and let \(\boldsymbol{x}_1 \sim p_{\mathrm{data}}\) be a target image. The first step is to define an interpolation path between them:

\[ \boldsymbol{x}_t = \phi_t(\boldsymbol{x}_0,\boldsymbol{x}_1), \qquad t \in [0,1]. \]

A particularly simple choice is linear interpolation:

\[ \boldsymbol{x}_t = (1-t)\boldsymbol{x}_0 + t \boldsymbol{x}_1. \]

This path should be interpreted as a family of intermediate random variables that gradually moves from noise to data.

import torch

x0 = torch.tensor([-1.0, 0.5])
x1 = torch.tensor([2.0, 1.5])
for t in torch.linspace(0.0, 1.0, 6):
    x_t = (1.0 - t) * x0 + t * x1
    print(f't={float(t):.1f} -> {x_t.tolist()}')

t=0.0 -> [-1.0, 0.5]
t=0.2 -> [-0.4000000059604645, 0.7000000476837158]
t=0.4 -> [0.19999998807907104, 0.9000000357627869]
t=0.6 -> [0.8000000715255737, 1.100000023841858]
t=0.8 -> [1.4000000953674316, 1.3000000715255737]
t=1.0 -> [2.0, 1.5]

Velocity fields.

If the interpolation path is differentiable in time, it has an associated velocity

\[ \boldsymbol{u}_t(\boldsymbol{x}_0,\boldsymbol{x}_1)=\frac{d}{dt}\phi_t(\boldsymbol{x}_0,\boldsymbol{x}_1). \]

For the linear path one gets

\[ \boldsymbol{u}_t(\boldsymbol{x}_0,\boldsymbol{x}_1)=\boldsymbol{x}_1-\boldsymbol{x}_0. \]

The learning problem is then: can we train a neural network to predict the correct velocity at each intermediate point of the path?

The flow matching objective.

One introduces a neural vector field

\[ \boldsymbol{v}_{\boldsymbol{\Theta}}(\boldsymbol{x},t) \]

and trains it by minimizing

\[ \mathcal{L}_{\mathrm{FM}}(\boldsymbol{\Theta}) = \mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{x}_1,t} \left[ \|\boldsymbol{v}_{\boldsymbol{\Theta}}(\boldsymbol{x}_t,t)-\boldsymbol{u}_t(\boldsymbol{x}_0,\boldsymbol{x}_1)\|_2^2 \right]. \]

The meaning of this loss is very direct. At every time \(t\), the model is asked to reproduce the velocity that should move the sample along the chosen path from noise toward data.

Once training is complete, the learned vector field defines the ODE

\[ \frac{d}{dt}\boldsymbol{x}_t = \boldsymbol{v}_{\boldsymbol{\Theta}}(\boldsymbol{x}_t,t), \qquad \boldsymbol{x}_0 \sim p_0. \]

Solving this ODE transports the base distribution into the data distribution.

# Tiny learned vector field for linear transport in 2D.
import torch

torch.manual_seed(0)

model = torch.nn.Sequential(
    torch.nn.Linear(3, 32),
    torch.nn.Tanh(),
    torch.nn.Linear(32, 2),
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

for step in range(300):
    x0 = torch.randn(128, 2)
    x1 = x0 + torch.tensor([2.0, -1.0])
    t = torch.rand(128, 1)
    xt = (1.0 - t) * x0 + t * x1
    target_velocity = x1 - x0
    pred_velocity = model(torch.cat([xt, t], dim=1))
    loss = torch.mean((pred_velocity - target_velocity) ** 2)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if step in [0, 49, 149, 299]:
        print(f'Step {step + 1:03d} | velocity loss = {loss.item():.6f}')

with torch.no_grad():
    test_xt = torch.tensor([[0.0, 0.0]])
    test_t = torch.tensor([[0.5]])
    pred = model(torch.cat([test_xt, test_t], dim=1))
    print('Predicted velocity near the middle of the path:', pred)

Step 001 | velocity loss = 3.626394
Step 050 | velocity loss = 0.021693
Step 150 | velocity loss = 0.002953
Step 300 | velocity loss = 0.000763
Predicted velocity near the middle of the path: tensor([[ 2.0481, -1.0079]])

This example is intentionally simple: the target transport is just a translation in the plane. Precisely because the target is simple, students can focus on the role of the vector field itself rather than on architectural complications.

import torch

x0 = torch.tensor([0.0, 0.0])
target = torch.tensor([3.0, -2.0])
state = x0.clone().float()
step_size = 0.2

for step in range(5):
    velocity = target - state
    state = state + step_size * velocity
    print(f'Step {step + 1}: state = {state.tolist()}')

print('This is the Euler discretization of a simple transport field toward the target point.')

Step 1: state = [0.6000000238418579, -0.4000000059604645]
Step 2: state = [1.0800000429153442, -0.7200000286102295]
Step 3: state = [1.4639999866485596, -0.9760000705718994]
Step 4: state = [1.7711999416351318, -1.1808000802993774]
Step 5: state = [2.0169599056243896, -1.3446400165557861]
This is the Euler discretization of a simple transport field toward the target point.

Comparison With Diffusion and Relevance for Imaging#

The comparison with diffusion is very instructive.

Diffusion models learn how to denoise a sample corrupted by a stochastic process. Flow matching models learn how to move a sample deterministically through a velocity field.

Diffusion emphasizes score estimation and reverse-time stochastic dynamics.

Flow matching emphasizes transport and ordinary differential equations.

The two viewpoints are related, but they give different algorithmic and conceptual advantages.

Why flow matching is attractive.

One major attraction is speed. Because the learned dynamics can be integrated with relatively few ODE solver steps, sampling can be significantly faster than in many diffusion pipelines.

Another attraction is flexibility. The model does not need to be tied to a particular noising schedule in the same way as diffusion. Instead, one chooses a probability path and learns the associated transport field.

This means that flow matching is often presented as a promising route toward faster high-quality generative modeling.

The importance of the chosen path.

At this stage, students should be warned that the path \(\phi_t\) is not arbitrary in practice. Different choices of interpolation yield different target vector fields and therefore different learning difficulties.

A poor path may force the model to learn unnecessarily complicated transport dynamics. A good path can make the flow smoother and easier to approximate.

This is another instance of a recurring course principle: the design of the training objective already contains substantial modeling assumptions.

Conditional Formulations for Inverse Problems#

To adapt the model to inverse problems, one conditions the vector field on the measured datum:

\[ \boldsymbol{v}_{\boldsymbol{\Theta}}(\boldsymbol{x},t,\boldsymbol{y}^\delta). \]

During training, one uses pairs \((\boldsymbol{x}^\dagger,\boldsymbol{y}^\delta)\) satisfying

\[ \boldsymbol{y}^\delta = K \boldsymbol{x}^\dagger + \boldsymbol{e}. \]

The conditional objective becomes

\[ \mathcal{L}_{\mathrm{CFM}}(\boldsymbol{\Theta}) = \mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{x}^\dagger,\boldsymbol{y}^\delta,t} \left[ \|\boldsymbol{v}_{\boldsymbol{\Theta}}(\boldsymbol{x}_t,t,\boldsymbol{y}^\delta)-\boldsymbol{u}_t(\boldsymbol{x}_0,\boldsymbol{x}^\dagger)\|_2^2 \right]. \]

At inference time, for a fixed measurement \(\boldsymbol{y}^\delta\), one integrates the learned conditional ODE from a sample of the base distribution. The result is a reconstruction distributed according to the learned conditional image law.

# Tiny conditional transport example where the condition shifts the target.
import torch

torch.manual_seed(0)

model = torch.nn.Sequential(
    torch.nn.Linear(4, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 2),
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

for step in range(300):
    x0 = torch.randn(128, 2)
    y = torch.randn(128, 1)
    shift = torch.cat([y, -0.5 * y], dim=1)
    x1 = x0 + shift
    t = torch.rand(128, 1)
    xt = (1.0 - t) * x0 + t * x1
    target_velocity = x1 - x0
    pred_velocity = model(torch.cat([xt, t, y], dim=1))
    loss = torch.mean((pred_velocity - target_velocity) ** 2)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if step in [0, 49, 149, 299]:
        print(f'Step {step + 1:03d} | conditional velocity loss = {loss.item():.6f}')

Step 001 | conditional velocity loss = 0.588843

Step 050 | conditional velocity loss = 0.001639
Step 150 | conditional velocity loss = 0.000491
Step 300 | conditional velocity loss = 0.000177

The measurement variable here is only a scalar placeholder, but the logic is the same as in a real conditional inverse problem: the transport field depends not only on the current state and time, but also on the observed datum that defines the target posterior.

Why this is interesting in imaging.

This conditional viewpoint is powerful for at least three reasons.

First, it allows uncertainty-aware reconstruction. One may sample multiple plausible outputs for the same measurement and thereby explore posterior variability.

Second, it can be computationally attractive because deterministic transport may require fewer steps than iterative reverse diffusion.

Third, it integrates naturally with the conditioning information coming from the acquisition process, making it a promising framework for modern learned inverse solvers.

Summary#

The conceptual roadmap of this chapter is:

flow matching learns deterministic transport rather than reverse denoising;
the model is trained to predict the velocity field along a chosen probability path;
ODE integration transforms a simple base law into the data distribution;
conditional flow matching adapts this idea to inverse problems by conditioning on the measurements;
the main appeal of the method is the possibility of faster posterior-like sampling compared with diffusion approaches.

Exercises#

Explain the difference between learning a score field and learning a velocity field.
Why does the choice of interpolation path matter in flow matching?
In a conditional inverse-problem setting, what role does the measurement play in the vector field (v_{\boldsymbol{\Theta}}(\boldsymbol{x},t,\boldsymbol{y}^\delta))?
Challenge exercise: compare flow matching with diffusion as posterior models for inverse problems. Which tradeoff would you expect in terms of speed, flexibility, and modeling difficulty?