Stochastic Gradient Descent#
While working with Machine Learning (ML) it is common to have a dataset \((X, Y) = \{ (x^i, y^i) \}_{i=1}^N\), and a parametric function \(f_\theta(x)\) whose specific shape depends on the task. As already cited, training a Machine Learning model is basically an optimization problem, where we need to find parameters \(\theta\) (known as weights), such that \(f_\theta(x^i) \approx y^i\) for any \(i = 1, \dots, N\). To do that, we usually consider a loss function, which in this case depends on the weights \(\theta\) and the dataset \((X, Y)\). We will indicate is as \(\ell(\theta; X, Y)\).
In most of the cases, \(\ell(\theta; X, Y)\) can be written as a sum of simpler components, each depending on a specific datapoint, i.e.
and the training optimization problem becomes
Which can be solved by Gradient Descent algorithm, as
Where we used that \(\nabla_\theta \ell(\theta_k; X, Y) = \nabla_\theta \sum_{i=1}^N \ell_i(\theta_k; x^i, y^i) = \sum_{i=1}^N \nabla_\theta \ell_i(\theta_k; x^i, y^i)\).
Therefore, to compute each iteration of Gradient Descent we need the gradient with respect to the weights of the objective functions, which is done by summing up the gradients of each independent functions \(\ell_i(\theta; x^i, y^i)\).
As an example, a common loss function in Machine Learning is the Mean Squared Error (MSE), defined by
Computing \(\nabla_\theta MSE(f_\theta(X), Y)\) is not hard, since
but:
by applying the chain rule. When \(\nabla_\theta f_\theta(x^i)\) can be explicitly computed (it depends on the shape of \(f_\theta\)), then the gradient descent iteration to solve the training optimization problem can be implemented as
Stochastic Gradient Descent (SGD) algorithm#
Unfortunately, even if it is easy to compute the gradient of \(\ell_i(\theta; x^i, y^i)\) for any \(i\), when the number of samples \(N\) is large (which is common in Machine Learning), the computation of the full gradient \(\nabla_\theta \ell(\theta; X, Y)\) is prohibitive, mostly because of memory limitations. For this reason, in such optimization problems, instead of using a standard GD algorithm, it is better using the Stochastic Gradient Descent (SGD) method. That is a variant of the classical GD where, instead of computing \(\nabla_\theta \ell(\theta; X, Y) = \sum_{i=1}^N \nabla_\theta \ell_i(\theta; x^i, y^i)\), the summation is reduced to a limited number of terms, called batch. The idea is the following:
Given a number \(N_{batch} \ll N\) (usually called
batch_size
), randomly extract a subdataset \(\mathcal{M}\) with \(\|\mathcal{M}\| = N_{batch}\) from \((X, Y)\). This set will be called a batch;Approximate the true gradient as:
where
Compute one single iteration of the GD algorithm
Repeat until you have extracted the full dataset. Notice that the random sampling at each iteration is done without replacement.
Each iteration of the algorithm above is usually called batch iteration. When the whole dataset has been processed, we say that we completed an epoch of the SGD method. This algorithm should be repeated for a fixed number \(E\) of epochs to reach convergence.
Below its a Python implementation of the SGD algorithm.
import numpy as np
def SGD(loss, grad_loss, D, theta0, alpha, batch_size, n_epochs):
X, y = D # Unpack the data
N = X.shape[0] # We assume both X and Y has shape (N, )
d = theta0.shape[0] # While theta0 has shape (d, )
idx = np.arange(0, N) # This is required for the shuffling
# Initialization of history vectors
theta_history = np.zeros((n_epochs, d)) # Save parameters at each epoch
loss_history = np.zeros((n_epochs, )) # Save loss values at each epoch
grad_norm_history = np.zeros((n_epochs, )) # Save gradient norms at each epoch
# Initialize weights
theta = theta0
for epoch in range(n_epochs):
# Shuffle the data at the beginning of each epoch
np.random.shuffle(idx)
X = X[idx]
y = y[idx]
# Initialize a vector that saves the gradient of the loss at each iteration
grad_loss_vec = []
for batch_start in range(0, N, batch_size):
batch_end = min(batch_start + batch_size, N)
X_batch = X[batch_start:batch_end]
y_batch = y[batch_start:batch_end]
# Compute the gradient of the loss
gradient = grad_loss(theta, X_batch, y_batch)
grad_loss_vec.append(np.linalg.norm(gradient, 2))
# Update weights
theta = theta - alpha * gradient
# Save the updated values
theta_history[epoch] = theta
loss_history[epoch] = loss(theta, X, y)
grad_norm_history[epoch] = np.mean(grad_loss_vec)
return theta_history, loss_history, grad_norm_history
Exercise: Test the SGD algorithm to train a polynomial regression model on the
poly_regression_large.csv
data. Try different values for the polynomial degree.
Warning
SGD has some drawbacks compared to GD. In particular, there is no way to check whether it reached the convergence (since we can’t obviously compute the gradient of \(\ell(\theta; X, Y)\) to check its distance from zero, as it is required for the first Stopping Criteria) and we can’t use the backtracking algorithm, for the same reason. As a consequence, the algorithm will stop ONLY after reaching the fixed number of epochs, and we must set a good value for the step size \(\alpha_k\) by hand. Those problems are partially solved by recent algorithms like SGD with Momentum, Adam, AdaGrad, … whose study is beyond the scope of the course.