Self-Supervised Training for Inverse Problems#

The Self-Supervised Reconstruction Problem#

After presenting supervised end-to-end training, the next natural question is whether one can still train reconstruction networks when clean targets are unavailable. In computational imaging this is not a peripheral issue. It is one of the central practical difficulties of the field.

The reason is straightforward. The object we want to reconstruct is usually not directly observable. If we had access to perfect target images for every measured datum, the inverse problem would already be partially solved. In many applications, collecting high-quality paired data is costly, dangerous, or physically impossible.

Therefore, the course should now make a conceptual shift. Instead of assuming that the target image is always given, we ask which pieces of information remain available and how they can be used to define a training objective.

The basic difficulty.

Suppose we only observe measurements

\[ \boldsymbol{y}^\delta = K\boldsymbol{x}^\dagger + \boldsymbol{e}, \]

but not the corresponding \(\boldsymbol{x}^\dagger\). If we still want to train a reconstructor \(f_{\boldsymbol{\Theta}}\), the most immediate idea is to enforce consistency with the measured datum:

\[ \min_{\boldsymbol{\Theta}} \frac{1}{N}\sum_{i=1}^N \|Kf_{\boldsymbol{\Theta}}(\boldsymbol{y}_i^\delta)-\boldsymbol{y}_i^\delta\|_2^2. \]

This objective is appealing because it involves only known quantities: the measurement, the forward operator, and the network output.

However, from an inverse-problems perspective the difficulty is immediate. The set of solutions to

\[ K\boldsymbol{x}=\boldsymbol{y}^\delta \]

is typically not a singleton. If \(K\) is ill-posed or rank-deficient, then many images are data-consistent. Therefore, pure measurement consistency is not enough to identify a meaningful reconstruction map.

This is the key teaching point: self-supervision never means learning without prior information. It means learning without direct target labels, while obtaining the missing information from another structural source.

import torch

A = torch.tensor([[1.0, 0.0, 1.0], [0.0, 1.0, 1.0]])
x1 = torch.tensor([1.0, 2.0, 0.0])
x2 = torch.tensor([0.0, 1.0, 1.0])
y1 = A @ x1
y2 = A @ x2

print('A x1 =', y1.tolist())
print('A x2 =', y2.tolist())
print('x1 and x2 are different:', not torch.allclose(x1, x2))
print('But they are equally data-consistent:', torch.allclose(y1, y2))
A x1 = [1.0, 2.0]
A x2 = [1.0, 2.0]
x1 and x2 are different: True
But they are equally data-consistent: True

Main Self-Supervised Strategies#

Note

A useful teaching principle is that self-supervision does not mean the absence of assumptions. It means that the assumptions are moved from explicit target images into the structure of the loss, the splitting of the data, or the architecture itself.

One of the earliest and most influential examples is the Deep Image Prior (DIP) [16]. Here one does not train a network across a dataset. Instead, for each single observed datum \(\boldsymbol{y}^\delta\), one fixes a random input \(\boldsymbol{z}\) and optimizes the parameters of a network \(g_{\boldsymbol{\Theta}}\) so that

\[ \min_{\boldsymbol{\Theta}} \|Kg_{\boldsymbol{\Theta}}(\boldsymbol{z})-\boldsymbol{y}^\delta\|_2^2. \]

At first sight this seems paradoxical. If no dataset is used and no explicit regularizer is present, where does the prior come from?

The answer is that the architecture itself acts as an implicit prior. Convolutional networks tend to generate structured images before they start fitting fine-scale noise. Thus, early stopping becomes a form of regularization.

This is a very instructive example to discuss in class because it shows that architecture is not only a computational convenience. It can itself encode a bias toward plausible images.

At the same time, DIP has clear drawbacks:

  • one must solve a new optimization problem for each datum;

  • the stopping criterion is delicate;

  • overfitting to noise eventually occurs;

  • the implicit prior is difficult to characterize precisely.

So DIP is both pedagogically valuable and practically limited.

# DIP-style optimization on one blurred Mayo image.
from pathlib import Path

def course_asset_path(name):
    here = Path.cwd().resolve()
    for base in (here, here.parent, here.parent.parent):
        candidate = base / 'imgs' / name
        if candidate.exists():
            return candidate
    raise FileNotFoundError(f'Could not locate imgs/{name} from {here}')
from PIL import Image
import numpy as np
import torch

torch.manual_seed(0)

img = Image.open(course_asset_path('Mayo.png')).convert('L').resize((48, 48))
x_true = torch.tensor(np.array(img), dtype=torch.float32).unsqueeze(0).unsqueeze(0) / 255.0
kernel = torch.tensor([[1.0, 2.0, 1.0], [2.0, 4.0, 2.0], [1.0, 2.0, 1.0]], dtype=torch.float32)
kernel = (kernel / kernel.sum()).view(1, 1, 3, 3)
y = torch.nn.functional.conv2d(x_true, kernel, padding=1)

model = torch.nn.Sequential(
    torch.nn.Conv2d(8, 32, 3, padding=1),
    torch.nn.ReLU(),
    torch.nn.Conv2d(32, 32, 3, padding=1),
    torch.nn.ReLU(),
    torch.nn.Conv2d(32, 1, 3, padding=1),
    torch.nn.Sigmoid(),
)
fixed_input = torch.rand(1, 8, 48, 48)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

for step in range(60):
    x_hat = model(fixed_input)
    data_loss = torch.mean((torch.nn.functional.conv2d(x_hat, kernel, padding=1) - y) ** 2)
    optimizer.zero_grad()
    data_loss.backward()
    optimizer.step()
    if step in [0, 9, 29, 59]:
        recon_error = torch.mean((x_hat.detach() - x_true) ** 2).sqrt().item()
        print(f'Step {step + 1:02d} | data loss = {data_loss.item():.6f} | image RMSE = {recon_error:.6f}')
Step 01 | data loss = 0.146653 | image RMSE = 0.398066
Step 10 | data loss = 0.030320 | image RMSE = 0.184844
Step 30 | data loss = 0.030320 | image RMSE = 0.184844
Step 60 | data loss = 0.030320 | image RMSE = 0.184844

The important thing to observe here is the distinction between the data term and the image error. In a true self-supervised setting, the image error would not be accessible. It is printed here only for teaching purposes, to show what the network is implicitly recovering while optimizing only through the forward model.

Measurement splitting and cross-prediction.

A second family of self-supervised methods starts from the idea that one measurement can sometimes be split into parts that contain complementary information about the same object.

Suppose that the measured datum can be decomposed as

\[ \boldsymbol{y}^\delta = (\boldsymbol{y}_1^\delta,\boldsymbol{y}_2^\delta), \]

with associated forward operators \(K_1\) and \(K_2\). One may then train a network to reconstruct from one subset and predict the other:

\[ \min_{\boldsymbol{\Theta}} \|K_2 f_{\boldsymbol{\Theta}}(\boldsymbol{y}_1^\delta)-\boldsymbol{y}_2^\delta\|^2. \]

Intuitively, the network is prevented from merely copying its input, because the target lives in a complementary measurement subset. This principle underlies methods such as Noise2Inverse and related strategies [8], and is closely related in spirit to self-supervised denoising ideas such as [1].

The success of this idea depends on several assumptions that should be stated explicitly:

  • the measurement can be split in a physically meaningful way;

  • each split retains enough information about the underlying image;

  • the noise in the different splits is independent or sufficiently decorrelated;

  • the forward model relating each split to the unknown object is known accurately enough.

This is a very good point to slow down during the lecture. Students often see the splitting trick and think it is universally applicable. It is not. Its validity depends strongly on acquisition geometry and noise structure.

import torch

torch.manual_seed(0)
x_true = torch.linspace(0.0, 1.0, 8)
noise = 0.05 * torch.randn(8)
y = x_true + noise

odd_measurements = y[::2]
even_measurements = y[1::2]

print('Full measurement vector:', y)
print('Odd-index split:', odd_measurements)
print('Even-index split:', even_measurements)
print('A splitting strategy can ask the model to predict one subset from the other.')
Full measurement vector: tensor([0.0770, 0.1282, 0.1768, 0.4570, 0.5172, 0.6444, 0.8773, 1.0419])
Odd-index split: tensor([0.0770, 0.1768, 0.5172, 0.8773])
Even-index split: tensor([0.1282, 0.4570, 0.6444, 1.0419])
A splitting strategy can ask the model to predict one subset from the other.

Equivariance-based self-supervision.

Another useful principle is to exploit symmetries. Suppose that a transformation \(T\) acts on image space and that the corresponding transformation in measurement space is \(S\). Ideally, reconstruction should commute with these transformations:

\[ f_{\boldsymbol{\Theta}}(S\boldsymbol{y}^\delta)\approx T f_{\boldsymbol{\Theta}}(\boldsymbol{y}^\delta). \]

This motivates a consistency term of the form

\[ \mathcal{L}_{\mathrm{eq}}(\boldsymbol{\Theta}) = \mathbb{E}\big[\|f_{\boldsymbol{\Theta}}(S\boldsymbol{y}^\delta)-Tf_{\boldsymbol{\Theta}}(\boldsymbol{y}^\delta)\|^2\big]. \]

The underlying idea is straightforward. Even if we do not know the ground truth, we often know that certain transformations should not alter the reconstruction logic in an arbitrary way.

Equivariance alone is usually not enough to solve the reconstruction problem, but it can be a powerful regularizing ingredient when combined with data-consistency terms.

Interpretation, Limitations, and Course Connections#

From a probabilistic perspective, supervised training gives direct access to pairs sampled from the joint law of \((\boldsymbol{y}^\delta,\boldsymbol{x}^\dagger)\). Self-supervised training removes the direct access to \(\boldsymbol{x}^\dagger\) and forces us to recover information about the unknown image indirectly.

This means that one must bring in prior information somewhere else. Depending on the method, this prior may come from:

  • the network architecture, as in DIP;

  • the splitting protocol, as in cross-prediction approaches;

  • a symmetry principle, as in equivariant training;

  • an explicit regularizer or pretrained prior.

This is an extremely important didactic sentence:

Important

Self-supervision does not eliminate the prior. It only changes where the prior enters the problem.

Once students understand this, they can evaluate self-supervised methods more critically and avoid treating them as label-free miracles.

Failure modes.

A careful course should also discuss how these methods can fail.

Null-space collapse.

If the training objective only checks whether the output matches the measurement after application of \(K\), then the network may drift inside the null space of \(K\) and produce artifacts that are invisible to the loss.

Identity leakage.

If the task setup allows the network to replicate too much of the noisy measurement, then denoising performance may collapse and the model may simply preserve corruption.

Model mismatch.

If the operator used in the self-supervised loss is not the actual acquisition operator, then the method may optimize the wrong physics. In inverse problems this mismatch can be fatal.

Correlated noise.

Splitting-based methods often assume independent noise across measurement subsets. If this assumption is violated, then the mathematical justification of the training objective weakens substantially.

These failure modes are not merely technical caveats. They are part of the conceptual understanding of self-supervised inverse problems.

How self-supervision connects to the rest of the course.

This chapter should naturally prepare the transition to noise modeling and to the inverse crime. Once one realizes that self-supervised training relies heavily on the forward model and on the statistical structure of the measurements, it becomes clear that poor simulation choices can invalidate the whole approach.

In other words, self-supervision makes the quality of the acquisition model even more central. When the network is not anchored by direct ground-truth labels, every mismatch in physics, noise, or discretization becomes more dangerous.

Summary#

The pedagogical roadmap of this chapter is the following:

  • self-supervised learning is necessary because paired clean targets are often unavailable;

  • pure data-consistency training is insufficient due to ill-posedness;

  • DIP uses architecture as an implicit prior;

  • measurement splitting uses complementary subsets of the same acquisition;

  • equivariance exploits known symmetries of the reconstruction map;

  • every self-supervised method hides a prior, even if that prior is not presented explicitly.

Exercises#

  1. Explain why the loss (|K f_{oldsymbol{\Theta}}(oldsymbol{y}^\delta)-oldsymbol{y}^\delta|^2) is generally insufficient on its own.

  2. Describe the source of prior information in Deep Image Prior.

  3. Give one example of a measurement-splitting idea that could be meaningful in an imaging problem, and one example where it would be questionable.

  4. Challenge exercise: compare self-supervised training and the inverse crime. Why does self-supervision make realistic forward modeling even more important?

Further Reading#

This chapter sits exactly at the boundary between inverse-problem modeling and machine learning methodology. To go further, students should compare these ideas with works on Deep Image Prior, Noise2Noise, Noise2Inverse, and equivariant imaging. The main intellectual task is to identify where the missing supervision is being replaced by structural assumptions.