Gradient Descent#

Consider a general continuous function f:RnR. Given a set of admissible solutions ΩRn, an optimization problem is a problem of the form

minxΩf(x)

If Ω=Rn, we say that the optimization is unconstrained. If ΩRn, the problem is constrained. In the following, we will always assume Ω=Rn, i.e. the unconstrained setup.

Convexity#

A common assumption in optimization is the convexity. By definition, a function f:RnR is convex if

f(tx1+(1t)x2)tf(x1)+(1t)f(x2),x1,x2Rn,t[0,1]

The importance of convex functions in optimization is that if f(x) is convex, than every minimum point x of f(x) is a global minimum and the set of global minima is connected (the n-dimensional extension of the concept of interval).

On the other side, if a function is non-convex (NOTE: the opposite of convex is not concave), then there can be multiple distinct minimum points, some of which are local minimum while others are global minimum.

Since optimizing convex functions is easier compared to non-convex functions, due to the absence of local minima, we will consider convex objective function for most of the coming discussion.

First-order conditions#

Most of the algorithms to find the minimum points of a given function f(x) are based on the following property:

First-order sufficient condition: If f:RnR is a continuously differentiable function and xRn is a minimum point of f(x), then f(x)=0.

Moreover, it holds:

First-order necessary condition: If f:RnR is a continuously differentiable function and f(x)=0 for xRn, then x is either a (local) minimum, a (local) maximum or a saddle point of f(x).

Consequently, we want to find a point xRn such that f(x)=0. Those points are called stationary points of f(x).

Gradient descent (GD)#

The most common algorithm to solve optimization problems is the so-called Gradient Descent (GD). It is an iterative algorithm, i.e. an algorithm that iteratively updates the estimate of the solution, defined as:

xk+1=xkαkf(xk),k=0,1,,maxit

where the initial iterate, x0Rn, is given as input and the step-size (equivalently, learning rate) αk>0 controls the decay rapidity of f(x) for any kN.

Note

The gradient descent algorithm is an example of descent methods:

xk+1=xk+αkpk,k=0,1,,maxit,

where the descent direction pk has to be chosen such that for any kN:

f(xk)Tpk0,

to assure convergence to a stationary point. Since for GD, pk=f(xk), then the condition above is always satisfied:

f(xk)Tpk=f(xk)Tf(xk)=||f(xk)||220.

Therefore, GD algorithm always converge to a stationary point in the limit of infinite iterations.

Choice the initial iterate#

The Gradient Descent (GD) algorithm, require the user to input an initial iterate x0Rn. Theoretically, since GD has a global convergence proprerty, for any x0 it will always converge to a stationary point of f(x), i.e. to a point such that f(x)=0.

If f(x) is convex, then every stationary point is a (global) minimum of f(x), implying that the choice of x0 is not really important, and we can always set x0=0. On the other side, when f(x) is not convex, we have to choose x0 such that it is as close as possible to the right stationary point, to increase the chances of getting to that. If an estimate of the correct minimum point is not available, we will just consider x0=0 to get to a general local minima.

Step Size#

Selecting the step size is arguably the hardest component of gradient descent algorithm. Indeed, there are three possible scenario that happens in selecting the step size:

  • αk is too small we never get to the minimum, getting closer and closer without reaching it. Moreover, we can easily get stuck on local minima when the objective function is non convex.

  • αk is too large we get stuck, bouncing back and forth around the minima.

  • αk is correct we reach the stationary point.

Backtracking#

Fortunately, there is a way to guarantee that the chosen step-size αk is correct. Indeed, it is possible to prove that, if αk satisfies the following conditions (known as Wolfe Conditions):

  • Sufficient decrease: f(xkαkf(xk))f(xk)c1αk||f(xk)||22;

  • Curvature condition: f(xk)Tf(xkαkf(xk))c2||f(xk)||22;

with 0<c1<c2<1, for any kN, then the associated GD algorithm will always converge to a stationary point. In particular, the Sufficient decrease condition, also known as Armijo Condition is of particular importance. It requires that xk+1 causes a reduction in the value of f compared to xk which is large enough. Indeed, note that the Armijo condition rewrites:

f(xk)f(xkαkf(xk)xk+1)c1αk||f(xk)||22,

meaning that the reduction in the function f due to xk+1 has to be at least as big as a multiple of ||f(xk)||22.

These conditions are automatically satisfied if αk is chosen by the backtracking algorithm. The idea of this algorithm is to begin with an initial guess for αk, which is then reduced as αkταk with τ<1 until the Armijo condition is satisfied. It can be proved that when αk is computed this way, the second Wolfe condition is automatically satistfied. A Python implementation for the backtracking algorithm can be found in the following.

import numpy as np

def backtracking(f, grad_f, x):
    """
    This function is a simple implementation of the backtracking algorithm for
    the GD (Gradient Descent) method.
    
    f: function. The function that we want to optimize.
    grad_f: function. The gradient of f(x).
    x: ndarray. The actual iterate x_k.
    """
    alpha = 1
    c = 0.8
    tau = 0.25
    
    while f(x - alpha * grad_f(x)) > f(x) - c * alpha * np.linalg.norm(grad_f(x), 2) ** 2:
        alpha = tau * alpha
    return alpha

Stopping Criteria#

The gradient descent is an iterative algorithm, meaning that it iteratively generates new estimates of the minima, starting from x0. Theoretically, after infinite iterations, we converge to the solution of the optimization problem but, since we cannot run infinite iterations, we have to find a way to tell the algorithm when its time to stop. A convergence condition for an iterative algorithm is called stopping criteria.

Remember that gradient descent aim to find stationary point. Consequently, it would make sense to use the norm of the gradient as a stopping criteria. In particular, it is common to check if the norm of the gradient on the actual iterate is below a certain tollerance and, if so, we stop the iterations. In particular

Stopping criteria 1: Given a tollerance tol_f, for any iterate xk, check whether or not ||f(xk)||tolf||f(x0)||. If so, stop the iterations.

Unfortunately, this condition alone is not sufficient. Indeed, if the function f(x) is almost flat around its minimum, then ||f(xk)|| will be small even if xk will be far from the true minimum.

Consequently, its required to add another stopping criteria.

Stopping criteria 2: Given a tollerance tol_x, for any iterate xk, check whether or not ||xkxk1||tolx. If so, stop the iterations.

Moreover, to stop the algorithm from running indefinitely, it is always recommended to add an additional variable, maxit, which automatically stops the algorithm if more than maxit iterations have been performed. When this happens, it is suggested to also tell the user that the algorithm stopped due to maxit condition, therefore it didn’t converge.

The choice for the parameters tol_f, tol_x and maxit is free and depends on the specific application. Usually, tol_f and tol_x are chosen in the range [1e-4, 1e-6], while maxit 1000.

Exercise: Write a script that implement the GD algorithm, with the following structure:

  • Input:

    • f: the function f(x) we want to optimize. It is supposed to be a Python function, not an array.

    • grad_f: the gradient of f(x). It is supposed to be a Python function, not an array.

    • x0: an n-dimensional array which represents the initial iterate.

    • alpha: a float. The fixed step size as input.

    • maxit: an integer. The maximum possible number of iterations (to avoid infinite loops).

    • tolf: small float. The relative tollerance of the algorithm. Convergence happens if ||grad_f(x_k)||_2 < tolf ||grad_f(x_0)||_2

    • tolx: small float. The tollerance in the input domain. Convergence happens if ||x_{k} - x_{k-1}||_2 < tolx. Pay attention to to the first iterate.

  • Output:

    • x: an array that contains the value of x_k FOR EACH iterate x_k (not only the latter).

    • k: an integer. The number of iteration needed to converge. k < kmax.

    • f_val: an array that contains the value of f(x_k) FOR EACH iterate x_k.

    • grad_norm: an array the contains the value of ||grad_f(x_k)||_2 FOR EACH iterate x_k.

Then test the function above to optimize the objective function:

f(x)=10(x11)2(x2+1)2,

With different value of α. Plot the value of the objective functions and the norm of the gradient to compare the behavior with different values of α.
Moreover, compute the minimum of f(x) by hand. Compute the relative error between the obtained solution with GD algorithm and the one computed by hand for each different choice of α.

Exercise: Write a script that implement the GD algorithm with backtracking, with the same structure as the GD algorithm defined above, with the exception of the input alpha, which is not required. Compared the results obtained by this algorithm with the ones obtained in the previous exercise.

Exercise: Test the GD algorithm with and without backtracking on the Rosenbrock function, defined as:

f(x)=100(x2x12)2+(1x1)2,

for different values of x0.