From Supervised Learning to Neural Networks#
Why This Chapter Comes First#
Before discussing CNNs, UNets, transformers, or generative models, it is important to fix the conceptual transition from classical inverse problems to learning-based reconstruction. If this transition is not made carefully, neural networks risk appearing as a completely separate subject, disconnected from the operator-theoretic and variational language of computational imaging. In reality, the connection is very direct.
In Module 1, the basic object of study was the forward model
where \(\boldsymbol{x}^\dagger \in \mathbb{R}^n\) denotes the unknown image, \(K \in \mathbb{R}^{m \times n}\) is the acquisition operator, and \(\boldsymbol{e} \in \mathbb{R}^m\) models the perturbation due to noise and model mismatch. The reconstruction problem was to recover \(\boldsymbol{x}^\dagger\) from \(\boldsymbol{y}^\delta\) by exploiting knowledge of \(K\), of the noise model, and of suitable regularity assumptions on the unknown image.
The learning-based viewpoint starts from exactly the same physical model, but modifies the strategy used to invert it. Instead of solving a new optimization problem for each datum, we collect a training set of examples and learn a parameterized inverse map directly from data.
Learning Formulation and Risk Minimization#
Assume that we have access to a dataset
where each pair consists of a measured datum and the corresponding ground-truth image. From a mathematical viewpoint, these pairs are sampled from an unknown probability distribution \(\mathbb{P}\) on the product space of data and images. The goal is to construct a function
depending on parameters \(\boldsymbol{\Theta}\), such that \(f_{\boldsymbol{\Theta}}(\boldsymbol{y}^\delta)\) is a good approximation of \(\boldsymbol{x}^\dagger\) whenever \((\boldsymbol{y}^\delta, \boldsymbol{x}^\dagger)\) follows the same law as the training data.
This is already a useful point to stress in class: a supervised neural reconstructor is not learning “the inverse of \(K\)” in a purely algebraic sense. It is learning an inverse map relative to a data distribution. In other words, it is trying to invert the forward model only on the subset of images that are considered plausible according to the training set.
This observation has two immediate consequences:
the quality of the training distribution is as important as the quality of the architecture;
a network that performs excellently on one image class may fail badly on another, even if the forward operator is the same.
Population risk and empirical risk.
To define training mathematically, introduce a loss function
where \(\ell(\widehat{\boldsymbol{x}},\boldsymbol{x}^\dagger)\) measures the discrepancy between a reconstruction \(\widehat{\boldsymbol{x}}\) and the target image \(\boldsymbol{x}^\dagger\). The ideal objective is the population risk
Of course, the distribution \(\mathbb{P}\) is unknown, so one replaces it by the empirical distribution of the dataset and minimizes the empirical risk
This is the standard framework of supervised learning, but in the inverse-problems setting the interpretation is especially rich. A broad machine learning background for this viewpoint can be found in [5]. The map \(f_{\boldsymbol{\Theta}}\) is not an arbitrary predictor. It is a learned regularized inverse. The training procedure tries to identify, inside a chosen model class, the map that best reconstructs typical images from typical measurements.
What the loss really measures.
At first sight, the choice of loss may seem secondary. In practice it is central, because it determines what kind of estimator the network is pushed to approximate.
If one uses the squared error
then the corresponding Bayes-optimal estimator is the conditional expectation
This means that MSE training pushes the network toward a posterior mean estimator. This is mathematically elegant, but it also explains a recurring phenomenon in imaging: if several reconstructions are compatible with the same data, then their average may be visually smoother than any actual plausible image. Hence the well-known tendency of MSE-trained models to blur high-frequency detail.
If instead one uses an \(\ell_1\) loss, the target becomes closer to a conditional median estimator. If one adds perceptual or adversarial terms, then one departs even further from a simple point estimator and starts favoring reconstructions that look realistic in a distributional sense.
Therefore, even before choosing the architecture, one should tell the students that the loss function is part of the inverse model. It encodes what is considered a good answer.
From Linear Predictors to Deep Nonlinear Models#
A natural starting point is the affine model
with \(W\in\mathbb{R}^{n\times m}\) and \(\boldsymbol{b}\in\mathbb{R}^n\). Such a model is attractive because it is easy to analyze, easy to optimize, and immediately connected to familiar linear inverse methods. However, it is limited in several decisive ways.
First, the map from measured data to stable reconstructions is usually not linear, even when the forward model is linear. Classical regularization already demonstrates this. For instance, the Tikhonov estimator
depends on the regularization parameter, on the prior model, and on the noise level. If these ingredients are adapted to the observed datum, one immediately leaves the realm of simple fixed linear inversion.
Second, even when a linear estimator is mathematically admissible, it may be too rigid to capture the structure of natural images. Images are not arbitrary vectors. They exhibit local patterns, edges, textures, repeated motifs, and large-scale organization. A dense affine map treats all coordinates as generic features and does not encode any of this structure.
Third, the parameter count is prohibitive. Mapping an image-sized datum to an image-sized output through a full matrix is often infeasible. This already suggests the need for structured parameterizations.
import torch
torch.manual_seed(0)
x = torch.linspace(-1.0, 1.0, 41).unsqueeze(1)
y = x**2 + 0.10 * x
X = torch.cat([x, torch.ones_like(x)], dim=1)
theta = torch.linalg.lstsq(X, y).solution
y_affine = X @ theta
affine_mse = torch.mean((y_affine - y) ** 2).item()
mlp = torch.nn.Sequential(
torch.nn.Linear(1, 16),
torch.nn.ReLU(),
torch.nn.Linear(16, 1),
)
optimizer = torch.optim.Adam(mlp.parameters(), lr=5e-2)
for _ in range(600):
pred = mlp(x)
loss = torch.mean((pred - y) ** 2)
optimizer.zero_grad()
loss.backward()
optimizer.step()
with torch.no_grad():
y_mlp = mlp(x)
mlp_mse = torch.mean((y_mlp - y) ** 2).item()
print(f'Affine model MSE: {affine_mse:.6f}')
print(f'Tiny nonlinear network MSE: {mlp_mse:.6f}')
print()
print('Sample predictions:')
for idx in [0, 10, 20, 30, 40]:
print(
f'x={x[idx].item():+.2f} | target={y[idx].item():+.3f} '
f'| affine={y_affine[idx].item():+.3f} | mlp={y_mlp[idx].item():+.3f}'
)
Affine model MSE: 0.097825
Tiny nonlinear network MSE: 0.000165
Sample predictions:
x=-1.00 | target=+0.900 | affine=+0.250 | mlp=+0.876
x=-0.50 | target=+0.200 | affine=+0.300 | mlp=+0.209
x=-0.00 | target=-0.000 | affine=+0.350 | mlp=+0.017
x=+0.50 | target=+0.300 | affine=+0.400 | mlp=+0.315
x=+1.00 | target=+1.100 | affine=+0.450 | mlp=+1.088
Why depth alone does not help.
At this point students often ask a very reasonable question: why not simply compose several affine maps? The answer is an excellent pedagogical opportunity, because it isolates the real source of expressive power in neural networks.
Let
Then
which is again affine. Thus, stacking linear maps without additional nonlinear operations does not enlarge the model class in any essential way.
This is one of the cleanest moments in the course to explain why neural networks require nonlinear activation functions. Depth matters, but only once linearity has been broken.
The role of nonlinear activations.
Let \(\rho:\mathbb{R}\to\mathbb{R}\) be a scalar nonlinear function, applied componentwise to vectors. A simple one-hidden-layer network has the form
Typical activations include:
ReLU, \(\rho(t)=\max\{t,0\}\);
leaky ReLU, \(\rho(t)=\max\{\alpha t,t\}\);
GELU, often used in transformer-based architectures.
The role of the activation is deeper than simply introducing a nonlinearity. It changes the geometry of the function class. A ReLU network, for example, represents a piecewise affine map, but with a very large number of affine regions. This already gives a hint of why such networks can approximate highly complex relations between measurements and images.
# Display a local course figure for the basic neural network architecture.
from pathlib import Path
def course_asset_path(name):
here = Path.cwd().resolve()
for base in (here, here.parent, here.parent.parent):
candidate = base / 'imgs' / name
if candidate.exists():
return candidate
raise FileNotFoundError(f'Could not locate imgs/{name} from {here}')
from PIL import Image
from IPython.display import display
display(Image.open(course_asset_path('NN.png')).resize((900, 760)))
Warning
A common misunderstanding is to think that depth alone creates expressivity. Without nonlinear activation functions, composing affine maps still produces an affine map. This is why the activation is a structural ingredient, not an optional embellishment.
Neural networks as learned regularizers.
A very useful bridge with classical inverse problems is the following interpretation. Even if a neural network is trained end-to-end, one can still view it as approximating the solution map of an optimization problem of the form
where \(\mathcal{D}\) is a data-fidelity term and \(\mathcal{R}\) is a regularizer.
The network does not necessarily make \(\mathcal{D}\) and \(\mathcal{R}\) explicit, but it behaves as if it had absorbed them into a learned nonlinear reconstruction rule. This is why it is sensible to say that a trained network is a learned regularizer or a learned inverse map. It does not replace the inverse-problems viewpoint. It instantiates it in a different way.
Depth, features, and internal representations.
For a deep network we write
The vectors \(\boldsymbol{h}_\ell\) are the internal features produced by the network. At this stage of the course it is useful to discuss what “feature” means concretely in imaging.
In the early layers, one often finds detectors of simple local events such as oriented edges, contrast changes, or simple repeated textures. In the middle layers, these elementary features are combined into larger structures. In deeper layers, the representation becomes more task-specific and more global. The network is therefore not only computing a final answer. It is building a hierarchy of descriptions of the datum.
This hierarchical viewpoint is one of the reasons depth is so important. A shallow model may in principle approximate the same function, but the parameter efficiency and representational organization can be dramatically worse.
Universal approximation is not the end of the story.
The universal approximation theorem is often cited when introducing neural networks. It states, roughly speaking, that sufficiently large networks can approximate continuous functions on compact sets. This result is mathematically important because it proves that nonlinear networks are not fundamentally limited in the way linear models are.
However, this theorem should not be overemphasized in a computational imaging course. In practice, one does not only care about approximation in an abstract sense. One also cares about:
how many parameters are needed;
whether the model can be trained stably;
how the architecture reflects known properties of images;
whether the reconstructor is robust to noise and operator mismatch.
In other words, the relevant question is not only can a neural network represent the desired inverse map, but rather which neural network architecture represents it efficiently and stably.
This is the precise point at which the course naturally moves from generic neural networks to image-specific architectures.
Optimization, Backpropagation, and Generalization#
Training a neural network means solving the nonconvex optimization problem
The main computational tool is backpropagation, which is simply an efficient implementation of the chain rule through the computational graph. If the training loss is denoted by \(L(\boldsymbol{\Theta})\), then the gradient descent update reads
In practice one rarely computes the full empirical gradient. Instead one uses minibatches and obtains a stochastic approximation:
where \(B_k\) denotes the current batch.
This point should be connected to numerical optimization from Module 1. Neural-network training is still numerical optimization. The novelty is not the existence of optimization, but the scale, nonconvexity, and parameterization of the problem.
import torch
layer = torch.nn.Linear(1, 1)
x = torch.tensor([[2.0]])
target = torch.tensor([[1.0]])
prediction = layer(x)
loss = torch.mean((prediction - target) ** 2)
loss.backward()
print('Prediction:', prediction.item())
print('Loss:', loss.item())
print('Gradient with respect to the weight:', layer.weight.grad.item())
print('Gradient with respect to the bias:', layer.bias.grad.item())
Prediction: 0.5673959255218506
Loss: 0.18714629113674164
Gradient with respect to the weight: -1.7304162979125977
Gradient with respect to the bias: -0.8652081489562988
Statistical generalization.
Another key topic to explain explicitly is generalization. A network may minimize the training loss very well and still fail on new data. In the inverse-problems setting this issue is even more delicate because the test datum may differ from training not only in image content, but also in:
noise level;
acquisition geometry;
calibration;
discretization;
anatomical class or object class.
Thus, when we say that a network generalizes, we mean that the learned inverse map remains reliable under realistic variation in both the unknown image and the measurement process.
This is why one should teach students very early that reconstruction quality cannot be separated from dataset design. A model trained on unrealistic synthetic pairs may generalize only inside a narrow artificial world.
Why the next step must be image-specific architectures.
At this point the motivation for image-oriented architectures is clear. Fully connected networks are conceptually useful but computationally and statistically inefficient for images. They do not respect locality, do not exploit translational patterns, and ignore the geometry of the image grid.
This motivates the next chapter. Once the idea of a learned inverse map is clear, one asks how to design the architecture so that it matches the structure of image data. The answers will be convolutions, multiscale encoders and decoders, skip connections, and attention mechanisms.
Summary#
This chapter should leave the students with a precise conceptual map:
supervised reconstruction is empirical risk minimization over parameterized inverse maps;
the loss function determines the statistical target of the estimator;
affine models are too limited for realistic imaging tasks;
nonlinear activations make expressive inverse modeling possible;
neural networks can be interpreted as learned regularized reconstructors;
the structure of images demands specialized architectures rather than generic dense networks.
Exercises#
Show explicitly that the composition of three affine maps is still affine.
Explain in your own words why the loss function changes the statistical estimator learned by the network.
Give one example of an inverse problem where a linear estimator is mathematically natural, and explain why it may still be inadequate in practice.
Discuss why image geometry suggests using structured architectures instead of generic fully connected networks.
Further Reading#
For a deeper background on neural networks in a broader machine learning setting, students should compare these notes with a standard deep learning text and with a more classical statistical learning perspective. When revising this chapter, it is especially useful to keep asking which statements are about approximation power, which are about optimization, and which are about statistical generalization.
A good way to study this material is to revisit each formula and ask how it changes once the input is an image rather than a generic vector. That question leads naturally to the architectural material of the next chapter.