Variational Autoencoders and GANs#

Latent Variable Modeling and Autoencoding#

Up to this point the main object of study has been a discriminative reconstructor, namely a map that takes a measured datum and outputs a reconstructed image. This is a natural place to start, but it has a limitation. It typically returns a single answer, while inverse problems are often ambiguous. Several different images may be consistent with the same measurements, especially when the forward operator is ill-conditioned or undersampled.

This motivates a broader question: instead of learning only how to map data to images, can we learn the image distribution itself? If we can do this, then we gain access to a learned image prior. This prior can later be combined with the data-consistency information coming from the forward model.

The rest of the course therefore shifts from discriminative learning to generative modeling. Historically, the first major deep generative paradigms were variational autoencoders and generative adversarial networks. They are different in philosophy, in mathematical formulation, and in the kind of prior they induce.

Latent variables and the low-dimensional hypothesis.

Both VAEs and GANs are built around a common idea: high-dimensional images may be generated from lower-dimensional latent variables. If \(\boldsymbol{x} \in \mathbb{R}^n\) is an image and \(\boldsymbol{z} \in \mathbb{R}^k\) is a latent code with \(k \ll n\), then one posits a generator or decoder map

\[ G_{\boldsymbol{\Theta}} : \mathbb{R}^k \to \mathbb{R}^n, \qquad \boldsymbol{x} \approx G_{\boldsymbol{\Theta}}(\boldsymbol{z}). \]

This expresses the belief that realistic images occupy only a tiny, structured region inside the ambient high-dimensional space. Inverse problems benefit enormously from this viewpoint, because restricting the search space from \(\mathbb{R}^n\) to the range of a generator can act as a powerful regularizer.

Autoencoders as the starting point.

Before introducing VAEs, it is useful to recall the ordinary autoencoder. An autoencoder consists of:

  • an encoder \(E_\phi\) that maps an image \(\boldsymbol{x}\) to a latent representation \(\boldsymbol{z}\);

  • a decoder \(G_{\boldsymbol{\Theta}}\) that maps the latent representation back to an approximate reconstruction.

The training problem is usually

\[ \min_{\phi,\boldsymbol{\Theta}} \mathbb{E}\big[\|\boldsymbol{x}-G_{\boldsymbol{\Theta}}(E_\phi(\boldsymbol{x}))\|^2\big]. \]

This architecture learns a compressed representation of the data, but by itself it does not define a full probabilistic generative model. In particular, it does not tell us how latent codes should be sampled in order to generate new images.

This is the point where the VAE enters [11].

# Tiny autoencoder on local image patches.
from pathlib import Path

def course_asset_path(name):
    here = Path.cwd().resolve()
    for base in (here, here.parent, here.parent.parent):
        candidate = base / 'imgs' / name
        if candidate.exists():
            return candidate
    raise FileNotFoundError(f'Could not locate imgs/{name} from {here}')
from PIL import Image
import numpy as np
import torch

torch.manual_seed(0)

img = Image.open(course_asset_path('GoPro.jpg')).convert('L').resize((48, 48))
x = torch.tensor(np.array(img), dtype=torch.float32) / 255.0
patches = []
for i in range(0, 32, 8):
    for j in range(0, 32, 8):
        patches.append(x[i:i+16, j:j+16].reshape(-1))
data = torch.stack(patches)

latent_dim = 16
encoder = torch.nn.Sequential(torch.nn.Linear(256, 64), torch.nn.ReLU(), torch.nn.Linear(64, latent_dim))
decoder = torch.nn.Sequential(torch.nn.Linear(latent_dim, 64), torch.nn.ReLU(), torch.nn.Linear(64, 256), torch.nn.Sigmoid())
params = list(encoder.parameters()) + list(decoder.parameters())
optimizer = torch.optim.Adam(params, lr=1e-2)

for epoch in range(200):
    z = encoder(data)
    recon = decoder(z)
    loss = torch.mean((recon - data) ** 2)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if epoch in [0, 19, 99, 199]:
        print(f'Epoch {epoch + 1:03d} | reconstruction loss = {loss.item():.6f}')
Epoch 001 | reconstruction loss = 0.059913
Epoch 020 | reconstruction loss = 0.022438
Epoch 100 | reconstruction loss = 0.002200
Epoch 200 | reconstruction loss = 0.000236

This example is deliberately small, but it shows the core autoencoder mechanism: an image patch is compressed into a low-dimensional latent representation and then decoded again. At this stage there is still no probabilistic latent model, only deterministic compression and reconstruction.

Variational Autoencoders as Probabilistic Generative Models#

A VAE defines a probabilistic latent-variable model

\[ p_{\boldsymbol{\Theta}}(\boldsymbol{x},\boldsymbol{z})=p_{\boldsymbol{\Theta}}(\boldsymbol{x}| \boldsymbol{z})p(\boldsymbol{z}), \]

where the prior on the latent variable is often chosen as

\[ p(\boldsymbol{z})=\mathcal{N}(0,I). \]

The induced model for the image is obtained by marginalization:

\[ p_{\boldsymbol{\Theta}}(\boldsymbol{x})=\int p_{\boldsymbol{\Theta}}(\boldsymbol{x}| \boldsymbol{z})p(\boldsymbol{z})\,dz. \]

This is the key point. The VAE is not merely compressing images. It is trying to assign a probability law to them through latent variables.

# Tiny VAE on the same local patch collection.
from pathlib import Path

def course_asset_path(name):
    here = Path.cwd().resolve()
    for base in (here, here.parent, here.parent.parent):
        candidate = base / 'imgs' / name
        if candidate.exists():
            return candidate
    raise FileNotFoundError(f'Could not locate imgs/{name} from {here}')
from PIL import Image
import numpy as np
import torch

torch.manual_seed(0)

img = Image.open(course_asset_path('GoPro.jpg')).convert('L').resize((48, 48))
x = torch.tensor(np.array(img), dtype=torch.float32) / 255.0
patches = []
for i in range(0, 32, 8):
    for j in range(0, 32, 8):
        patches.append(x[i:i+16, j:j+16].reshape(-1))
data = torch.stack(patches)

latent_dim = 8
encoder = torch.nn.Sequential(torch.nn.Linear(256, 64), torch.nn.ReLU())
mu_head = torch.nn.Linear(64, latent_dim)
logvar_head = torch.nn.Linear(64, latent_dim)
decoder = torch.nn.Sequential(torch.nn.Linear(latent_dim, 64), torch.nn.ReLU(), torch.nn.Linear(64, 256), torch.nn.Sigmoid())
params = list(encoder.parameters()) + list(mu_head.parameters()) + list(logvar_head.parameters()) + list(decoder.parameters())
optimizer = torch.optim.Adam(params, lr=5e-3)

for epoch in range(250):
    h = encoder(data)
    mu = mu_head(h)
    logvar = logvar_head(h)
    eps = torch.randn_like(mu)
    z = mu + torch.exp(0.5 * logvar) * eps
    recon = decoder(z)
    recon_loss = torch.mean((recon - data) ** 2)
    kl = 0.5 * torch.mean(torch.exp(logvar) + mu ** 2 - 1.0 - logvar)
    loss = recon_loss + 0.05 * kl
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if epoch in [0, 24, 99, 249]:
        print(f'Epoch {epoch + 1:03d} | recon = {recon_loss.item():.6f} | KL = {kl.item():.6f}')
Epoch 001 | recon = 0.064360 | KL = 0.031250
Epoch 025 | recon = 0.052795 | KL = 0.002086
Epoch 100 | recon = 0.050103 | KL = 0.011786
Epoch 250 | recon = 0.044030 | KL = 0.081561

Compared with the deterministic autoencoder, the VAE now pays an explicit KL price for keeping the latent distribution close to a Gaussian prior. This makes sampling possible, but it also explains why VAE reconstructions tend to be more regularized and often smoother.

Why the exact likelihood is difficult.

The log-likelihood

\[ \log p_{\boldsymbol{\Theta}}(\boldsymbol{x}) = \log \int p_{\boldsymbol{\Theta}}(\boldsymbol{x}| \boldsymbol{z})p(\boldsymbol{z})\,dz \]

is generally intractable because the integral over latent space cannot be computed exactly for a complex neural decoder. The VAE solves this by introducing an approximate posterior distribution

\[ q_\phi(\boldsymbol{z}| \boldsymbol{x}), \]

often called the encoder distribution, and then deriving a tractable lower bound on the log-likelihood.

The ELBO.

The central identity of the VAE is

\[ \log p_{\boldsymbol{\Theta}}(\boldsymbol{x}) \geq \mathbb{E}_{q_\phi(\boldsymbol{z}| \boldsymbol{x})}[\log p_{\boldsymbol{\Theta}}(\boldsymbol{x}| \boldsymbol{z})] -\operatorname{KL}\big(q_\phi(\boldsymbol{z}| \boldsymbol{x})\,\|\,p(\boldsymbol{z})\big). \]

The right-hand side is the evidence lower bound, or ELBO. It has a very instructive decomposition.

The first term is the reconstruction term. It asks the decoder to explain the observed image well when the latent variable is sampled from the encoder distribution.

The second term is the KL regularization term. It pushes the encoder posterior toward the prior \(p(\boldsymbol{z})\). This is what makes latent sampling possible and gives the model genuine generative semantics.

import torch

mu = torch.tensor([0.2, -0.4, 0.1])
log_var = torch.tensor([-0.7, 0.0, 0.3])
reconstruction_error = torch.tensor(0.85)

kl = 0.5 * torch.sum(torch.exp(log_var) + mu**2 - 1.0 - log_var)
elbo = -(reconstruction_error + kl)

print('Toy reconstruction term:', float(reconstruction_error))
print('Toy KL term:', float(kl))
print('Toy ELBO:', float(elbo))
Toy reconstruction term: 0.8500000238418579
Toy KL term: 0.2282220721244812
Toy ELBO: -1.0782220363616943

Interpreting the Gaussian decoder.

If the conditional model is chosen as

\[ p_{\boldsymbol{\Theta}}(\boldsymbol{x}| \boldsymbol{z})=\mathcal{N}(\mu_{\boldsymbol{\Theta}}(\boldsymbol{z}),\sigma^2I), \]

then maximizing the reconstruction term is equivalent, up to constants, to minimizing

\[ \|\boldsymbol{x}-\mu_{\boldsymbol{\Theta}}(\boldsymbol{z})\|_2^2. \]

This is one of the reasons VAEs are often associated with smooth reconstructions. The Gaussian decoder and the averaged nature of the likelihood term favor mean-like outputs.

This is not a flaw of implementation. It is a direct consequence of the probabilistic assumptions built into the model.

The reparameterization trick.

There is one additional ingredient that deserves explicit explanation in teaching: how does one differentiate through the random latent variable? The standard answer is the reparameterization trick. If the encoder outputs a mean \(\mu_\phi(\boldsymbol{x})\) and a standard deviation \(\sigma_\phi(\boldsymbol{x})\), then one writes

\[ \boldsymbol{z} = \mu_\phi(\boldsymbol{x}) + \sigma_\phi(\boldsymbol{x})\odot \boldsymbol{\varepsilon}, \qquad \boldsymbol{\varepsilon} \sim \mathcal{N}(0,I). \]

This converts sampling into a deterministic function of the parameters and an auxiliary random variable. Backpropagation can then proceed normally.

This is a good example of how probability and optimization interact in deep learning.

import torch

torch.manual_seed(0)
mu = torch.tensor([1.0, -1.0])
std = torch.tensor([0.5, 0.25])

samples = []
for _ in range(5):
    eps = torch.randn_like(std)
    z = mu + std * eps
    samples.append(z)

print('Five latent samples obtained through reparameterization:')
for sample in samples:
    print(sample)
Five latent samples obtained through reparameterization:
tensor([ 1.7705, -1.0734])
tensor([-0.0894, -0.8579])
tensor([ 0.4577, -1.3496])
tensor([ 1.2017, -0.7905])
tensor([ 0.6404, -1.1008])

Strengths and weaknesses of VAEs.

VAEs have several attractive properties:

  • a principled probabilistic interpretation;

  • an explicit encoder and decoder;

  • a latent space that is regularized and sampleable;

  • a meaningful relation to approximate Bayesian inference.

Their weaknesses are equally important:

  • generated samples and reconstructions can be too smooth;

  • the ELBO may not align perfectly with perceptual visual quality;

  • the chosen likelihood model can be too simplistic for complex textures.

For inverse problems, these strengths and weaknesses matter because a smooth latent prior may be mathematically convenient but may underrepresent fine imaging detail.

Adversarial Learning and GANs#

GANs take a strikingly different path. Instead of optimizing a tractable lower bound on the data likelihood, a GAN trains two networks in competition:

  • a generator \(G_{\boldsymbol{\Theta}}(\boldsymbol{z})\) that maps latent codes to images;

  • a discriminator \(D_\psi(\boldsymbol{x})\) that tries to distinguish real samples from generated ones.

The classical minimax problem is [6]

\[ \min_{\boldsymbol{\Theta}} \max_\psi \mathbb{E}_{\boldsymbol{x}\sim p_{\mathrm{data}}}[\log D_\psi(\boldsymbol{x})] + \mathbb{E}_{\boldsymbol{z}\sim p(\boldsymbol{z})}[\log(1-D_\psi(G_{\boldsymbol{\Theta}}(\boldsymbol{z})))]. \]

The generator tries to fool the discriminator, while the discriminator tries not to be fooled.

Warning

A sharp-looking generated image is not the same thing as a well-calibrated probabilistic model. This distinction is especially important when one later uses the model as a prior for inverse problems.

# Tiny 1D GAN-style training loop on a simple target distribution.
import torch

torch.manual_seed(0)

generator = torch.nn.Sequential(torch.nn.Linear(1, 16), torch.nn.ReLU(), torch.nn.Linear(16, 1))
discriminator = torch.nn.Sequential(torch.nn.Linear(1, 16), torch.nn.ReLU(), torch.nn.Linear(16, 1))
opt_g = torch.optim.Adam(generator.parameters(), lr=1e-2)
opt_d = torch.optim.Adam(discriminator.parameters(), lr=1e-2)

for step in range(200):
    real = torch.randn(64, 1) * 0.2 + 1.5
    z = torch.randn(64, 1)
    fake = generator(z).detach()

    d_real = discriminator(real)
    d_fake = discriminator(fake)
    loss_d = torch.nn.functional.binary_cross_entropy_with_logits(d_real, torch.ones_like(d_real)) +              torch.nn.functional.binary_cross_entropy_with_logits(d_fake, torch.zeros_like(d_fake))
    opt_d.zero_grad()
    loss_d.backward()
    opt_d.step()

    z = torch.randn(64, 1)
    fake = generator(z)
    d_fake = discriminator(fake)
    loss_g = torch.nn.functional.binary_cross_entropy_with_logits(d_fake, torch.ones_like(d_fake))
    opt_g.zero_grad()
    loss_g.backward()
    opt_g.step()

    if step in [0, 49, 99, 199]:
        with torch.no_grad():
            samples = generator(torch.randn(256, 1))
            print(f'Step {step + 1:03d} | D loss = {loss_d.item():.4f} | G loss = {loss_g.item():.4f} | fake mean = {samples.mean().item():.4f}')
Step 001 | D loss = 1.4222 | G loss = 0.6463 | fake mean = 0.1051
Step 050 | D loss = 1.2525 | G loss = 1.2192 | fake mean = 2.9288
Step 100 | D loss = 1.3479 | G loss = 0.6396 | fake mean = 1.4250
Step 200 | D loss = 1.2644 | G loss = 1.2425 | fake mean = 2.3706

This is only a 1D adversarial game, but it helps students see the characteristic GAN behavior: two networks are trained against each other, and the generator is judged through the discriminator rather than through a pointwise reconstruction loss.

import torch

real_logits = torch.tensor([2.5, 1.8, 3.2])
fake_logits = torch.tensor([-1.2, -0.7, 0.1])

real_prob = torch.sigmoid(real_logits)
fake_prob = torch.sigmoid(fake_logits)

discriminator_loss = -(torch.log(real_prob).mean() + torch.log(1.0 - fake_prob).mean())
generator_loss = -torch.log(fake_prob).mean()

print('Discriminator probabilities on real samples:', real_prob)
print('Discriminator probabilities on fake samples:', fake_prob)
print('Toy discriminator loss:', float(discriminator_loss))
print('Toy generator loss:', float(generator_loss))
Discriminator probabilities on real samples: tensor([0.9241, 0.8581, 0.9608])
Discriminator probabilities on fake samples: tensor([0.2315, 0.3318, 0.5250])
Toy discriminator loss: 0.5608953237533569
Toy generator loss: 1.0702883005142212

Intuition behind adversarial training.

The key idea is that instead of comparing each generated sample to a specific target image, one asks whether generated images as a distribution are indistinguishable from real images. This is a radical conceptual shift.

Because GANs are not driven by pointwise reconstruction losses in the usual way, they can generate much sharper and more realistic fine detail than VAEs. This is one of the main reasons GANs attracted enormous interest.

Typical difficulties of GANs.

At the same time, GANs are notoriously delicate. Three issues are especially important in teaching.

Mode collapse.

The generator may map many latent vectors to very similar images, thereby covering only a limited portion of the data distribution.

Training instability.

The optimization problem is not a simple minimization but a saddle-point game. This can produce oscillations, imbalance between generator and discriminator, and sensitivity to hyperparameters.

Lack of an explicit tractable likelihood.

GANs often generate excellent samples, but they do not naturally provide a convenient density model. This limits the direct probabilistic interpretation of the learned prior.

Comparison and Relevance for Inverse Problems#

These two models are pedagogically valuable precisely because they illustrate two contrasting philosophies of generative modeling.

The VAE is likelihood-oriented, variational, and probabilistically explicit.

The GAN is adversarial, game-theoretic, and focused on distributional realism rather than tractable density.

Seeing both helps students understand that “generative model” is not a single recipe. It is a family of approaches for representing complex data distributions.

Relation to inverse problems.

Now we come to the point that matters most for the course. Why are these models useful in imaging inverse problems?

The answer is that both VAEs and GANs define low-dimensional models of plausible images. If the reconstruction is constrained to lie in the range of a decoder or generator, then the search space is drastically reduced.

Suppose

\[ \boldsymbol{x} = G_{\boldsymbol{\Theta}}(\boldsymbol{z}). \]

Then instead of solving for \(\boldsymbol{x} \in \mathbb{R}^n\), one may solve for \(\boldsymbol{z} \in \mathbb{R}^k\):

\[ \widehat{\boldsymbol{z}} = \operatorname*{arg\,min}_{\boldsymbol{z}} \|KG_{\boldsymbol{\Theta}}(\boldsymbol{z})-\boldsymbol{y}^\delta\|^2+\lambda\|\boldsymbol{z}\|^2. \]

The recovered image is then

\[ \widehat{\boldsymbol{x}}=G_{\boldsymbol{\Theta}}(\widehat{\boldsymbol{z}}). \]

This is powerful because \(k \ll n\). However, it also introduces a bias: the true image must be well approximated by the range of the generator.

Summary#

This chapter should leave the following clear roadmap:

  • generative modeling aims to learn the image distribution, not only an input-output map;

  • VAEs build a probabilistic latent-variable model through the ELBO;

  • GANs learn realism through adversarial distribution matching;

  • VAEs are principled and stable but often smooth;

  • GANs are sharp and expressive but harder to train and analyze;

  • both are useful in inverse problems because they provide learned low-dimensional priors over images.

Exercises#

  1. Explain the role of the KL term in the VAE objective.

  2. Why can a VAE be sampled from more naturally than a plain autoencoder?

  3. What is mode collapse in a GAN, and why is it problematic?

  4. Discuss why latent generative models can be useful as priors for inverse problems.

Further Reading#

To deepen this material, students should compare the philosophical difference between VAEs and GANs as much as the technical difference. VAEs are tied to approximate likelihood-based inference, while GANs are tied to adversarial distribution matching. Keeping that contrast in mind is very helpful when these models are later used as priors for inverse problems.

A good study question is the following: when does one care more about a tractable probabilistic interpretation, and when does one care more about sample realism?