Gradient Descent#
Consider a general continuous function
If
Convexity#
A common assumption in optimization is the convexity. By definition, a function
The importance of convex functions in optimization is that if
On the other side, if a function is non-convex (NOTE: the opposite of convex is not concave), then there can be multiple distinct minimum points, some of which are local minimum while others are global minimum.
Since optimizing convex functions is easier compared to non-convex functions, due to the absence of local minima, we will consider convex objective function for most of the coming discussion.
First-order conditions#
Most of the algorithms to find the minimum points of a given function
First-order sufficient condition: If
Moreover, it holds:
First-order necessary condition: If
Consequently, we want to find a point
Gradient descent (GD)#
The most common algorithm to solve optimization problems is the so-called Gradient Descent (GD). It is an iterative algorithm, i.e. an algorithm that iteratively updates the estimate of the solution, defined as:
where the initial iterate,
Note
The gradient descent algorithm is an example of descent methods:
where the descent direction
to assure convergence to a stationary point. Since for GD,
Therefore, GD algorithm always converge to a stationary point in the limit of infinite iterations.
Choice the initial iterate#
The Gradient Descent (GD) algorithm, require the user to input an initial iterate
If
Step Size#
Selecting the step size is arguably the hardest component of gradient descent algorithm. Indeed, there are three possible scenario that happens in selecting the step size:
is too small we never get to the minimum, getting closer and closer without reaching it. Moreover, we can easily get stuck on local minima when the objective function is non convex. is too large we get stuck, bouncing back and forth around the minima. is correct we reach the stationary point.
Backtracking#
Fortunately, there is a way to guarantee that the chosen step-size
Sufficient decrease:
;Curvature condition:
;
with
meaning that the reduction in the function
These conditions are automatically satisfied if
import numpy as np
def backtracking(f, grad_f, x):
"""
This function is a simple implementation of the backtracking algorithm for
the GD (Gradient Descent) method.
f: function. The function that we want to optimize.
grad_f: function. The gradient of f(x).
x: ndarray. The actual iterate x_k.
"""
alpha = 1
c = 0.8
tau = 0.25
while f(x - alpha * grad_f(x)) > f(x) - c * alpha * np.linalg.norm(grad_f(x), 2) ** 2:
alpha = tau * alpha
return alpha
Stopping Criteria#
The gradient descent is an iterative algorithm, meaning that it iteratively generates new estimates of the minima, starting from
Remember that gradient descent aim to find stationary point. Consequently, it would make sense to use the norm of the gradient as a stopping criteria. In particular, it is common to check if the norm of the gradient on the actual iterate is below a certain tollerance and, if so, we stop the iterations. In particular
Stopping criteria 1: Given a tollerance tol_f
, for any iterate
Unfortunately, this condition alone is not sufficient. Indeed, if the function
Consequently, its required to add another stopping criteria.
Stopping criteria 2: Given a tollerance tol_x
, for any iterate
Moreover, to stop the algorithm from running indefinitely, it is always recommended to add an additional variable, maxit
, which automatically stops the algorithm if more than maxit
iterations have been performed. When this happens, it is suggested to also tell the user that the algorithm stopped due to maxit
condition, therefore it didn’t converge.
The choice for the parameters tol_f
, tol_x
and maxit
is free and depends on the specific application. Usually, tol_f
and tol_x
are chosen in the range [1e-4
, 1e-6
], while maxit
1000
.
Exercise: Write a script that implement the GD algorithm, with the following structure:
Input:
f
: the function we want to optimize. It is supposed to be a Python function, not an array.grad_f
: the gradient of . It is supposed to be a Python function, not an array.x0
: an -dimensional array which represents the initial iterate.alpha
: a float. The fixed step size as input.maxit
: an integer. The maximum possible number of iterations (to avoid infinite loops).tolf
: small float. The relative tollerance of the algorithm. Convergence happens if ||grad_f(x_k)||_2 < tolf ||grad_f(x_0)||_2tolx
: small float. The tollerance in the input domain. Convergence happens if ||x_{k} - x_{k-1}||_2 < tolx. Pay attention to to the first iterate.
Output:
x
: an array that contains the value of x_k FOR EACH iterate x_k (not only the latter).k
: an integer. The number of iteration needed to converge. k < kmax.f_val
: an array that contains the value of f(x_k) FOR EACH iterate x_k.grad_norm
: an array the contains the value of ||grad_f(x_k)||_2 FOR EACH iterate x_k.
Then test the function above to optimize the objective function:
With different value of
Moreover, compute the minimum of
Exercise: Write a script that implement the GD algorithm with backtracking, with the same structure as the GD algorithm defined above, with the exception of the input alpha
, which is not required. Compared the results obtained by this algorithm with the ones obtained in the previous exercise.
Exercise: Test the GD algorithm with and without backtracking on the Rosenbrock function, defined as:
for different values of