Unconstrained Optimization

Resources

Main Idea

Solve

minxf(x).

Iterative methods are used most commonly. The general form for updating iteration k includes a step size or learning rate αk0, and a direction pk:

xk+1=xk+αkpk.

Using pk=f(xk) gives Gradient Descent, which can be related to Ordinary Differential Equations through the Gradient Flow Equation.

Supposedly, second-order methods (using 2f information) are categorized as either line-search methods or trust-region methods, by restricting pk or αk first.

Line-search Methods

After determining pk, αk is determined based on the 1-dimensional problem,

argminα0f(xk+αpk).

Usually, this is solved approximately, by an inexact line search.

Wolfe Conditions

The Armijo rule is

f(xk+αkpk)f(xk+1)f(xk)+c1pkTf(xk)<0 for a descent method.

In short, for a descent method, this condition requires that f is reduced "enough", based on f(xk) and pk. For instance, if both of those are very large and in the same direction, then we should be able decrease f more substantially than if they are small, or do not agree with one another. Based on this, it is also referred to as the sufficient decrease rule.

The curvature rule is

pkTf(xk+αkpk)direction at next stepc2pkTf(xk)

We can rearrange to reveal a term resembling f(xk+αkpk)f(xk), which is part of the numerator of a forward difference approximating the Hessian along pk. The direction of the inequality corresponds to positive concavity. To further understand this rule, suppose αk=0, then

pkTf(xk)c2pkTf(xk).

If pk is a descent direction, then pkTf(xk)<0, so this would require

1c2.

yet, we often take c2=0.9, so this condition would be violated.

The strong Wolfe condition on curvature extends this idea, requiring

|pkTf(xk+αkpk)|c2|pkTf(xk)|.

Suppose f(xk)=pk and f(xk+αkpk)=pk, in other words, αk would be so aggressive that the slope becomes the opposite. The (weak) condition on curvature would have:

pkTpkc2pkTpk,||pk||22c2||pk||22,1c2,

which is satisfied for the usual c2=0.9. However, the strong version would require |1|=1c2, thus rejecting this step. Thus, to summarize, the strong Wolfe condition prevents not only small steps but also those that are too aggressive.

Quasi-Newton Methods

Newton's Method computes the update as

xk+1=xk[2f(xk)]Bk1f(xk).

Near the solution, this has quadratic convergence, but further away, it may not even be a descent method. Further, computing the Hessian is expensive, especially as the size of x grows.

Quasi-Newton methods approximate the Hessian with an iteratively updated Bk. This must satisfy the secant equation:

Bk+1(xk+1xk)sk=f(xk+1)f(xk)yk.

Note that Bk depends on xk, (not xk+1), and Bk+1 is used to find xk+2. We can view this as a backward Finite Difference approximation of the Hessian, along the axis aligning with xk+1xk. In 1D this gives a single formula,

Bk+1=f(xk+1)f(xk)xk+1xkf(xk+1).

There are multiple unique ways of constructing Bk, thus differentiating Symmetric Rank 1 (SR1) from Broyden-Fletcher-Goldfarb-Shanno (BFGS) methods.

BFGS methods take Hk=Bk1, and

pk=Hkf(xk),

where taking the update to the approximated Hessian is specified as

Bk+1=Bk+ykykTykTskBkskskTBkTskTBksk.

Instead of computing Bk+11, we can use the Sherman-Morrison formula to update the inverse from the previous step, as this is a rank 1 (?) update.

TODO

Go through the PyTorch implementation of LBFGS. Pay particular attention to how the line-search is done and the other heuristics that are used.

AdaGrad

AdaGrad adds a preconditioner that is based on the gradient history, weighting updates to commonly updated parameters less than infrequently updated parameters.
Denoting gk=f(xk) as the gradient at step k,

G=i=1kgkgkT.

The update is then given as

xk+1=xkηdiag(G)12gk.

In other words, the preconditioner for parameter j divides by the sum of the 2 Norm of the history of updates (a vector with k entries, not n).

TODO

Trust-Region Methods