A brief overview of PyTorch

A brief overview of PyTorch#

In this chapter, we will briefly introduce PyTorch: arguably the most used library for developing Neural Network models in Python. In particular, we will focus on few key components:

Tensors: Tensors are the building block of any pytorch model. Since we will largely use them, we need to at least learn their main properties and functionalities;
Data: Loading data in memory is a fundamental step in developing Neural Network-based models, and it requires special attention when the number of datapoints is large;
Model Design: A good model requires carefully optimizing its architecture (i.e. number of layers, number of neurons per layer, activation function, …). In this chapter we will learn how to deploy a simple MLP network, and we will come back to architecture design later in the course;
Training: Training a model (i.e. optimizing its parameters to achieve the task described by the dataset) requires setting up a few basic components. In this chapter, we will learn how to train a neural network model on a fairly simple dataset, with the default choices of each component.

Note

While we will try to cover all the basics of Neural Network in this course, what described in the following is far from being a complete introduction to the topic. Please refer to the official pytorch documentation or to any tutorial on Youtube for a more complete introduction.

What is PyTorch?#

PyTorch is an open-source deep learning framework developed by Facebook’s AI Research Lab (FAIR). It provides tensor computation, automatic differentiation, and deep learning model building capabilities with a user-friendly and Pythonic interface.

PyTorch is widely used in both research and industry due to its flexibility, ease of debugging, and strong community support. It enables researchers and developers to quickly prototype and train neural networks using GPUs for acceleration.

Key Features of PyTorch#

Dynamic Computational Graphs: Unlike TensorFlow 1.x, which relied on static graphs, PyTorch dynamically builds computational graphs, making it easier to debug and modify models.
Automatic Differentiation (Autograd): PyTorch automatically computes gradients, making it seamless to implement backpropagation for neural networks (a topic which will be deeper explained later).
GPU Acceleration: PyTorch seamlessly integrates with CUDA and MPS (on Apple Silicon CPUs) for fast GPU computing.
Strong Ecosystem: Includes tools like torchvision for images, torchtext for NLP, and torchaudio for speech processing.

PyTorch vs. TensorFlow#

PyTorch and TensorFlow are the two most popular deep learning frameworks. Here’s a comparison of their strengths and weaknesses:

Feature	PyTorch	TensorFlow
Ease of Use	Intuitive, Pythonic	More complex, requires more boilerplate
Dynamic Graphs	✅ Yes	🚫 No (TF 1.x), ✅ Yes (TF 2.x)
Debugging	Easier (native Python debugging tools)	More difficult (static graphs in TF 1.x)
Performance	Excellent for research and fast prototyping	Optimized for large-scale deployment
Ecosystem	Torchvision, TorchText, TorchAudio	TensorFlow Hub, TF-Agents (RL), TensorFlow.js
Industry Adoption	Preferred in research	Preferred in large-scale industry applications
Community Support	Strong in academia and research	Larger enterprise-level adoption

Which One Should You Choose?#

During your master degree, you will get in touch with both the frameworks described above. In particular, the course Deep Learning from professor Andrea Asperti will teach you Tensorflow, while we will use Pytorch in this course. However, you should:

Choose PyTorch if:
- You prioritize ease of use and fast prototyping.
- You work in research or academia.
- You need dynamic graphs for flexible model structures.
Choose TensorFlow if:
- You want better production-ready tools for deployment.
- You are working in enterprise applications with large-scale models.

Installation#

PyTorch provides an easy installation process. You can install it using pip (for Python users) or conda (for Anaconda users). It is sufficient to copy and paste the command from the official website: https://pytorch.org/get-started/locally/ by selecting your system preferences from the menu.

Note

I recommend to always use pip to install pytorch as it usually causes less issues.

Once installed, verify the installation by running the following command in Python:

import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print("CUDA Available:", torch.cuda.is_available())

CUDA Available: False

If torch is installed correctly, it should print the version number and confirm whether CUDA is available.

Pytorch Tensors#

At the core of Pytorch is the tensor, a multi-dimensional array similar to numpy arrays but with additional capabilities, such as GPU acceleration and automatic differentiation.

Pytorch provides multiple ways to create tensors, most of which have the same syntax as numpy arrays:

import torch

# Creating a tensor from a list
t1 = torch.tensor([1, 2, 3])
print(t1)

# Creating a tensor with predefined values
t2 = torch.zeros(3, 3)  # 3x3 matrix of zeros
t3 = torch.ones(2, 4)   # 2x4 matrix of ones
t4 = torch.rand(2, 2)   # 2x2 matrix of random values between 0 and 1

print(t2)
print(t3)
print(t4)

tensor([1, 2, 3])
tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])
tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.]])
tensor([[0.2745, 0.0137],
        [0.6782, 0.3994]])

Tensor Properties#

Each torch tensor has several key attributes:

t = torch.rand(3, 4)

print(f"Shape: {t.shape}")  # Dimensions of the tensor
print(f"Data type: {t.dtype}")  # Data type (default is float32)
print(f"Device: {t.device}")  # CPU or GPU

Shape: torch.Size([3, 4])
Data type: torch.float32
Device: cpu

Basic Tensor Operations#

Tensors support element-wise operations, matrix multiplications, and reshaping.

x = torch.tensor([[1, 2], [3, 4]])
y = torch.tensor([[5, 6], [7, 8]])

# Element-wise operations
print(x + y)  # Addition
print(x * y)  # Multiplication
print(torch.sqrt(x.float()))  # Square root (requires float type)

# Matrix multiplication
print(x @ y)  # Equivalent to torch.matmul(x, y)

# Reshaping tensors
z = torch.arange(6).reshape(2, 3)
print(z)

tensor([[ 6,  8],
        [10, 12]])
tensor([[ 5, 12],
        [21, 32]])
tensor([[1.0000, 1.4142],
        [1.7321, 2.0000]])
tensor([[19, 22],
        [43, 50]])
tensor([[0, 1, 2],
        [3, 4, 5]])

Moving Tensors to GPU#

If a GPU is available, we can move tensors to it for faster computation.

if torch.cuda.is_available():
    device = torch.device("cuda")  # Use GPU
    x = x.to(device)
    print(f"Tensor is now on: {x.device}")
else:
    print("CUDA is not available. Running on CPU.")

CUDA is not available. Running on CPU.

DataLoading#

In deep learning, we often work with large datasets that cannot fit into memory all at once. torch provides efficient tools to handle data loading through the Dataset and DataLoader classes.

The Dataset Class#

Pytorch’s torch.utils.data.Dataset is an abstract class that must be subclassed to define custom datasets. A Dataset object should implement three methods:

__init__: Initializes the dataset (e.g., loads file paths, applies transformations).
__len__: Returns the total number of samples in the dataset.
__getitem__: Retrieves a single sample by index.

Creating a Custom Dataset: Mathematically, a dataset is a sequence of pairs $\{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(N)}, y^{(N)}) \}$, where each $x^{(i)}$ is a $d$-dimensional vector, while $y^{(i)}$ is an $s$-dimensional vector.

To create a dataset in torch, we need to build a dataset class so that, when it is called on index i, it returns the couple $(x^{(i)}, y^{(i)})$ as a tuple of tensors with shape (d, ) and (s, ), called input and output shape, respectively.

Sometimes (when the dimensionality of the data allows it), all the datapoints gets stacked together in two large tensors X and Y, which thus have shape (N, d) and (N, s). Clearly, X[i, :] corresponds to the input tensor $x^{(i)}$, and Y[i, :] corresponds to the output tensor $y^{(i)}$.

Consider, as an example, the dataset $D = \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(N)}, y^{(N)}) \}$, such that $x^{(i)}$ are uniformly distributed datapoints in the range $[-2, 2]$, while $y^{(i)} = 2 x^{(i)} + 3$. In the following, we will avoid building X and Y explicitly, relying instead on definining a constructor that returns the pair $(x^{(i)}, y^{(i)})$ upon request.

import torch
from torch.utils.data import Dataset

class SimpleDataset(Dataset):
    def __init__(self, N=100):
        self.x = torch.linspace(-2, 2, N)
        self.y = 2 * self.x + 3  # Linear function

    def __len__(self):
        return len(self.x)

    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

# Create dataset instance
dataset = SimpleDataset(N=200)

# Fetch a single data point
idx = 10
x_sample, y_sample = dataset[idx]
print(f"x: {x_sample}, y: {y_sample}")

x: -1.798995018005371, y: -0.5979900360107422

The DataLoader Class#

We already observed that when the dataset is too large, sometimes it cannot be loaded into memory as a whole (expecially when working with GPU, which has usually lower dedicated memory compared to the system). On the other side, relying on single samples is usually too complex, as we would need more time to process the whole dataset.

For this reason, when working with basically any Machine Learning algorithm (and in particular with neural networks), it is common to work with mini-batches. A minibatch is a subset of the dataset which contains a limited amount of memory, built by concatenating together multiple datapoints randomly extracted by the dataset. Usually, in Pytorch, a mini-batch (often called simply batch) is represented as a pair of tensors (x, y) with shapes (b, d) and (b, s), respectively, where the first dimension represents the batch axis, where the number of elements b is called batch_size.

The operation of randomly sampled a given number of datapoints from the Dataset object in Pytorch is called a DataLoader.

from torch.utils.data import DataLoader

# Create a DataLoader
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Iterate through batches
for batch in dataloader:
    x_batch, y_batch = batch
    print(f"Batch - x: {x_batch}, y: {y_batch}")

    break # For site impagination

Batch - x: tensor([-0.0101, -1.7990]), y: tensor([ 2.9799, -0.5980])

Key Parameters of DataLoader:

batch_size: Number of samples per batch.
shuffle: Whether to shuffle the data at the beginning of each epoch.
num_workers: Number of subprocesses to use for data loading (useful for large datasets).
drop_last: Whether to drop the last incomplete batch if dataset size isn’t divisible by batch size.

An example: the California Housing Dataset#

Let’s see an example on how to load a built-in dataset using sklearn.datasets. In particular, we we’ll load we will use the California Housing dataset, which is a regression dataset where the goal is to predict house prices based on features such as median income, number of rooms, and population in an area. This dataset contains 8 numerical features (e.g., median income, total rooms, housing age, etc.) and one target variable (median house value in $100,000s).

import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load dataset from sklearn
data = fetch_california_housing()
X, y = data.data, data.target

Preprocessing the Data#

Since neural networks work best with normalized inputs, we standardize the features using StandardScaler.

# Standardize features for better training stability
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

y = y.reshape(-1, 1)  # Reshape target to be a column vector

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

Creating a Custom PyTorch Dataset#

We define a custom dataset by subclassing torch.utils.data.Dataset.

class CaliforniaHousingDataset(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Create Dataset instances
train_dataset = CaliforniaHousingDataset(X_train_tensor, y_train_tensor)
test_dataset = CaliforniaHousingDataset(X_test_tensor, y_test_tensor)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

Next, we’ll define our first neural network model for classification using this dataset.

Defining Our First Model in PyTorch#

Now that we understand how to load data, let’s build a simple fully connected (dense) neural network using torch.nn.Module. We’ll start with a basic model and then improve it step by step.

Defining a Simple Neural Network#

PyTorch models are created by subclassing torch.nn.Module. The key components are:

__init__: Defines the layers.
forward: Defines how data flows through the model.

Let’s create a simple Multi-Layer Perceptron (MLP) with one hidden layer:

import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)  # First fully connected layer
        self.fc2 = nn.Linear(hidden_size, output_size) # Output layer

    def forward(self, x):
        x = nn.ReLU()(self.fc1(x))  # Apply ReLU activation to first layer
        x = self.fc2(x)          # Output layer (no activation for now)
        return x

# Create an instance of the model
model = SimpleNN(input_size=8, 
                 hidden_size=64, 
                 output_size=1)
print(model)

SimpleNN(
  (fc1): Linear(in_features=8, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=1, bias=True)
)

Explanation:

nn.Linear(input_size, hidden_size): Fully connected layer transforming the input.
nn.ReLU()(x): Applies a non-linear activation function (ReLU activation).
forward(x): Defines how the input data is processed.
No activation on the last layer: Typically, output activations depend on the task (e.g., sigmoid for binary classification, softmax for multi-class classification).

Passing Data Through the Model#

Given the model and the dataset, we can check its prediction over a random batch of datapoints (given by the DataLoader).

# Sample data from the dataset
x_batch, y_batch = next(iter(train_loader))

# Check the shape of the batch
print(f"Shape of x_batch: {x_batch.shape}. Shape of y_batch: {y_batch.shape}")

# Forward pass through the model
y_prediction = model(x_batch)

# Visualizing a value compared to the real (expected) solution
print(f"Real value: {y_batch[0].item()}. Model prediction: {y_prediction[0].item()}.")

Shape of x_batch: torch.Size([16, 8]). Shape of y_batch: torch.Size([16, 1])
Real value: 1.2940000295639038. Model prediction: -0.1022803857922554.

Training a Model#

You probably noticed that the model prediction is completely different from the real value of the target variable. This happens as the model has not been trained yet. As already remarked, training a model is the process of iteratively update its parameters $\Theta$ so that it matches the training data.

A neural network model is usually trained by a variant of the Stochastic Gradient Descent (SGD) algorithm: the stochastic version of the Gradient Descent optimization algorithm. In particular, given an initial value for the model parameters $\Theta_0$, a loss function $\ell: \mathbb{R}^s \times \mathbb{R}^s \to \mathbb{R}_+$, and a training dataset $D$, the SGD algorithms iteratively update the parameters based on the following procedure:

Sample a batch $(x_b, y_b)$ from $D$.
Compute $g_k = \nabla_{\Theta} \ell(f_{\Theta_k}(x^{(i)}_b), y^{(i)}_b)$.
Update $\Theta_{k+1} = \Theta_k - \nu g_k$.

At this point, we already discussed how to create and sample a batch of data from $D$. The next step we need to learn is how to compute $g_k$, and here is where Pytorch becomes really useful.

Automatic Differentiation#

Pytorch tensors differs from numpy arrays mainly in that they keep track of each operations leading from a leaf tensor (i.e. a freshly created tensor) to the present tensor. This option (which is activated by default), can be modified by accessing the requires_grad property of the tensor.

When a leaf tensor is declared with requires_grad = True, each operation involving it gets memorized. This way, it is possible to automatically compute the gradient of any function with respect to the leaf tensor by backpropagating from the output to the input via the computational graph, using the chain rule to combine the derivative at each step.

Indeed, we recall that if $g: \mathbb{R}^n \to \mathbb{R}^n$ is a function mapping a leaf tensor to an intermediate value $z = g(x)$, and $f: \mathbb{R}^n \to \mathbb{R}$ is a scalar function (such as a loss function), mapping $z$ to an output $y = f(z)$, then the gradient of $f(g(x))$ with respect of $x$ can be easily computed as:

\[ \nabla_x f(g(x)) = J_g(x) \nabla_z f(z). \]

This process is automatically performed in Pytorch by calling the .backward() method on any non-leaf tensor. The gradient with respect to $x$ can then be accessed by calling x.grad. For example:

import torch

# Create a leaf tensor
x = torch.linspace(0, 1, 20, requires_grad=True)

# Compute y = x**2
y = torch.square(x)

# Compute loss = sum(x**2)
loss = torch.sum(y)

# Compute gradient of the loss
loss.backward()

# Extract gradient wrt x -> d/dx loss(x^**2) = 2*x
g = x.grad
print(g)

tensor([0.0000, 0.1053, 0.2105, 0.3158, 0.4211, 0.5263, 0.6316, 0.7368, 0.8421,
        0.9474, 1.0526, 1.1579, 1.2632, 1.3684, 1.4737, 1.5789, 1.6842, 1.7895,
        1.8947, 2.0000])

Training a neural network#

This process can be exploited to run the Stochastic Gradient Descent (SGD) algorithm and train the neural network on our train loader.

To do that, we should initialize an optimizer which keeps track of the gradient of the loss function with respect to the parameters $\Theta$ when the .backward() method is called on the loss function, and it also applies the gradient descent step to the model parameters to update them. This is done as follows:

# Define loss function (for example, MSE)
loss_fn = nn.MSELoss()

# Define optimizer (feeding the model parameters into it)
# Adam -> variant of SGD algorithm commonly used nowadays
#   lr -> "learning rate"
optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-4)

# Set other parameters (e.g. the number of epochs: number of times the training loop is repeated)
n_epochs = 50

# Epoch cycle
for epoch in range(n_epochs):
    avg_loss = 0.0

    # Training loop
    for k, data in enumerate(train_loader):
        # Get x, y from data
        x, y = data

        # Send to device
        x = x.to(device)
        y = y.to(device)

        # Compute neural network prediction
        y_pred = model(x)

        # Compare y_pred with the real y
        loss = loss_fn(y_pred, y)

        # Compute gradient
        loss.backward()

        # Update model weights
        optimizer.step()
        optimizer.zero_grad() # Reset the optimizer state: IMPORTANT

        # Print out the avg value of the loss
        # Commented for site impagination
        # print(f"Epoch: {epoch}. Avg Loss: {loss.item() / (k+1):0.4f}", end="\r")
    # print()

# Saving the model after the cycle
torch.save(model.state_dict(), "path-for-model.pth")

Testing the trained model#

Now that we optimized the neural network parameters, we are ready to check whether the prediction of the network on new data is good or not. To do that, we can simply load a batch from the test set, compute the prediction on it, and check if it matches the real value.

To save memory, this operation can be done without tracking the gradient, as we won’t use it to update the model weights. This is done by calling the operation in bewteen the with torch.no_grad() environment.

# Disable gradient memorization
with torch.no_grad():
    # Sample data from the dataset
    x_batch, y_batch = next(iter(train_loader))

    # Send to device
    x_batch = x_batch.to(device)
    y_batch = y_batch.to(device)

    # Forward pass through the model
    y_prediction = model(x_batch)

    print(f"Prediction: {y_prediction[0].item():0.4f}. True: {y_batch[0].item():0.4f}.")

Prediction: 2.0243. True: 1.8320.

Now it is definitely better! Clearly, the prediction can be largely improved by optimizing all the parameters that we set up to this point. However, this is out of the scope of this course.

We can now move to the next chapter, where we will learn how to actually reconstruct images with neural networks.