Stochastic Gradient Descent

Stochastic Gradient Descent#

While working with Machine Learning (ML) it is common to have a dataset \((X, Y) = \{ (x^i, y^i) \}_{i=1}^N\), and a parametric function \(f_\theta(x)\) whose specific shape depends on the task. As already cited, training a Machine Learning model is basically an optimization problem, where we need to find parameters \(\theta\) (known as weights), such that \(f_\theta(x^i) \approx y^i\) for any \(i = 1, \dots, N\). To do that, we usually consider a loss function, which in this case depends on the weights \(\theta\) and the dataset \((X, Y)\). We will indicate is as \(\ell(\theta; X, Y)\).

In most of the cases, \(\ell(\theta; X, Y)\) can be written as a sum of simpler components, each depending on a specific datapoint, i.e.

\[ \ell(\theta; X, Y) = \sum_{i=1}^N \ell_i(\theta; x^i, y^i) \]

and the training optimization problem becomes

\[ \theta^* = \arg\min_\theta \ell(\theta; X, Y) = \arg\min_\theta \sum_{i=1}^N \ell_i(\theta; x^i, y^i) \]

Which can be solved by Gradient Descent algorithm, as

\[\begin{split} \begin{cases} \theta_0 \in \mathbb{R}^d \\ \theta_{k+1} = \theta_k - \alpha_k \nabla_\theta \ell(\theta_k; X, Y) = \theta_k - \alpha_k \sum_{i=1}^N \nabla_\theta \ell_i(\theta_k; x^i, y^i) \end{cases} \end{split}\]

Where we used that \(\nabla_\theta \ell(\theta_k; X, Y) = \nabla_\theta \sum_{i=1}^N \ell_i(\theta_k; x^i, y^i) = \sum_{i=1}^N \nabla_\theta \ell_i(\theta_k; x^i, y^i)\).

Therefore, to compute each iteration of Gradient Descent we need the gradient with respect to the weights of the objective functions, which is done by summing up the gradients of each independent functions \(\ell_i(\theta; x^i, y^i)\).

As an example, a common loss function in Machine Learning is the Mean Squared Error (MSE), defined by

\[ \ell(\theta; X, Y) := MSE(f_\theta(X), Y) = \frac{1}{N} \sum_{i=1}^N (f_\theta(x^i) - y^i)^2 = \sum_{i=1}^N \underbrace{\frac{1}{N} (f_\theta(x^i) - y^i)^2}_{=: \ell_i(\theta; x^i, y^i)}. \]

Computing \(\nabla_\theta MSE(f_\theta(X), Y)\) is not hard, since

\[ \nabla_\theta MSE(f_\theta(X), Y) = \nabla_\theta \sum_{i=1}^N \frac{1}{N} (f_\theta(x^i) - y^i)^2 =\ \sum_{i=1}^N \nabla_\theta (f_\theta(x^i) - y^i)^2, \]

but:

\[ \nabla_\theta (f_\theta(x^i) - y^i)^2 = 2 (f_\theta(x^i) - y^i) \nabla_\theta f_\theta(x^i) \]

by applying the chain rule. When \(\nabla_\theta f_\theta(x^i)\) can be explicitly computed (it depends on the shape of \(f_\theta\)), then the gradient descent iteration to solve the training optimization problem can be implemented as

\[ \theta_{k+1} = \theta_k - \alpha_k \underbrace{\frac{2}{N} \sum_{i=1}^N (f_\theta(x^i) - y^i) \nabla_\theta f_\theta(x^i)}_{\nabla_\theta \ell(\theta; X, Y)}. \]

Stochastic Gradient Descent (SGD) algorithm#

Unfortunately, even if it is easy to compute the gradient of \(\ell_i(\theta; x^i, y^i)\) for any \(i\), when the number of samples \(N\) is large (which is common in Machine Learning), the computation of the full gradient \(\nabla_\theta \ell(\theta; X, Y)\) is prohibitive, mostly because of memory limitations. For this reason, in such optimization problems, instead of using a standard GD algorithm, it is better using the Stochastic Gradient Descent (SGD) method. That is a variant of the classical GD where, instead of computing \(\nabla_\theta \ell(\theta; X, Y) = \sum_{i=1}^N \nabla_\theta \ell_i(\theta; x^i, y^i)\), the summation is reduced to a limited number of terms, called batch. The idea is the following:

  • Given a number \(N_{batch} \ll N\) (usually called batch_size), randomly extract a subdataset \(\mathcal{M}\) with \(\|\mathcal{M}\| = N_{batch}\) from \((X, Y)\). This set will be called a batch;

  • Approximate the true gradient as:

\[ \nabla_\theta \ell(\theta; X, Y) = \sum_{i=1}^N \nabla_\theta \ell_i(\theta; x^i, y^i), \]

where

\[ \nabla_\theta \ell(\theta; \mathcal{M}) = \sum_{i\in \mathcal{M}} \nabla_\theta \ell_i(\theta; x^i, y^i); \]
  • Compute one single iteration of the GD algorithm

\[ \theta_{k+1} = \theta_k - \alpha_k \nabla_\theta \ell(\theta; \mathcal{M}); \]
  • Repeat until you have extracted the full dataset. Notice that the random sampling at each iteration is done without replacement.

Each iteration of the algorithm above is usually called batch iteration. When the whole dataset has been processed, we say that we completed an epoch of the SGD method. This algorithm should be repeated for a fixed number \(E\) of epochs to reach convergence.

Below its a Python implementation of the SGD algorithm.

import numpy as np

def SGD(loss, grad_loss, D, theta0, alpha, batch_size, n_epochs):
    X, y = D  # Unpack the data
    N = X.shape[0] # We assume both X and Y has shape (N, )
    d = theta0.shape[0] # While theta0 has shape (d, )
    idx = np.arange(0, N) # This is required for the shuffling
    
    # Initialization of history vectors
    theta_history = np.zeros((n_epochs, d))  # Save parameters at each epoch
    loss_history = np.zeros((n_epochs, ))  # Save loss values at each epoch
    grad_norm_history = np.zeros((n_epochs, ))  # Save gradient norms at each epoch
    
    # Initialize weights
    theta = theta0
    for epoch in range(n_epochs):
        # Shuffle the data at the beginning of each epoch
        np.random.shuffle(idx)
        X = X[idx]
        y = y[idx]

        # Initialize a vector that saves the gradient of the loss at each iteration
        grad_loss_vec = []

        for batch_start in range(0, N, batch_size):
            batch_end = min(batch_start + batch_size, N)
            X_batch = X[batch_start:batch_end]
            y_batch = y[batch_start:batch_end]
            
            # Compute the gradient of the loss
            gradient = grad_loss(theta, X_batch, y_batch)
            grad_loss_vec.append(np.linalg.norm(gradient, 2))

            # Update weights
            theta = theta - alpha * gradient

        # Save the updated values
        theta_history[epoch] = theta
        loss_history[epoch] = loss(theta, X, y)
        grad_norm_history[epoch] = np.mean(grad_loss_vec)
    
    return theta_history, loss_history, grad_norm_history

Exercise: Test the SGD algorithm to train a polynomial regression model on the poly_regression_large.csv data. Try different values for the polynomial degree.

Warning

SGD has some drawbacks compared to GD. In particular, there is no way to check whether it reached the convergence (since we can’t obviously compute the gradient of \(\ell(\theta; X, Y)\) to check its distance from zero, as it is required for the first Stopping Criteria) and we can’t use the backtracking algorithm, for the same reason. As a consequence, the algorithm will stop ONLY after reaching the fixed number of epochs, and we must set a good value for the step size \(\alpha_k\) by hand. Those problems are partially solved by recent algorithms like SGD with Momentum, Adam, AdaGrad, … whose study is beyond the scope of the course.