Unconstrained Optimization

Resources

Kiyani, Elham, Khemraj Shukla, Jorge F. Urbán, Jérôme Darbon, and George Em Karniadakis. 2025. “Which Optimizer Works Best for Physics-Informed Neural Networks and Kolmogorov-Arnold Networks?” arXiv. https://doi.org/10.48550/arXiv.2501.16371.

Main Idea

Solve

min_{x} f (x) .

Iterative methods are used most commonly. The general form for updating iteration $k$ includes a step size or learning rate $α_{k} \geq 0$ , and a direction $p_{k}$ :

x_{k + 1} = x_{k} + α_{k} p_{k} .

Using $p_{k} = - \nabla f (x_{k})$ gives Gradient Descent, which can be related to Ordinary Differential Equations through the Gradient Flow Equation.

Supposedly, second-order methods (using $\nabla^{2} f$ information) are categorized as either line-search methods or trust-region methods, by restricting $p_{k}$ or $α_{k}$ first.

Line-search Methods

After determining $p_{k}$ , $α_{k}$ is determined based on the 1-dimensional problem,

\underset{α \geq 0}{\arg min} f (x_{k} + α p_{k}) .

Usually, this is solved approximately, by an inexact line search.

Wolfe Conditions

The Armijo rule is

\underset{f (x_{k + 1})}{\underset{⏟}{f (x_{k} + α_{k} p_{k})}} \leq f (x_{k}) + c_{1} \underset{< 0 for a descent method}{\underset{⏟}{p_{k}^{T} \nabla f (x_{k})}} .

In short, for a descent method, this condition requires that $f$ is reduced "enough", based on $\nabla f (x_{k})$ and $p_{k}$ . For instance, if both of those are very large and in the same direction, then we should be able decrease $f$ more substantially than if they are small, or do not agree with one another. Based on this, it is also referred to as the sufficient decrease rule.

The curvature rule is

- p_{k}^{T} \underset{direction at next step}{\underset{⏟}{\nabla f (x_{k} + α_{k} p_{k})}} \leq - c_{2} p_{k}^{T} \nabla f (x_{k})

We can rearrange to reveal a term resembling $\nabla f (x_{k} + α_{k} p_{k}) - \nabla f (x_{k})$ , which is part of the numerator of a forward difference approximating the Hessian along $p_{k}$ . The direction of the inequality corresponds to positive concavity. To further understand this rule, suppose $α_{k} = 0$ , then

p_{k}^{T} \nabla f (x_{k}) \geq c_{2} p_{k}^{T} \nabla f (x_{k}) .

If $p_{k}$ is a descent direction, then $p_{k}^{T} \nabla f (x_{k}) < 0$ , so this would require

1 \leq c_{2} .

yet, we often take $c_{2} = 0.9$ , so this condition would be violated.

The strong Wolfe condition on curvature extends this idea, requiring

| p_{k}^{T} \nabla f (x_{k} + α_{k} p_{k}) | \leq c_{2} | p_{k}^{T} \nabla f (x_{k}) | .

Suppose $\nabla f (x_{k}) = - p_{k}$ and $\nabla f (x_{k} + α_{k} p_{k}) = p_{k}$ , in other words, $α_{k}$ would be so aggressive that the slope becomes the opposite. The (weak) condition on curvature would have:

\begin{matrix} - p_{k}^{T} p_{k} \leq c_{2} p_{k}^{T} p_{k}, \\ - | | p_{k} | |_{2}^{2} \leq c_{2} | | p_{k} | |_{2}^{2}, \\ - 1 \leq c_{2}, \end{matrix}

which is satisfied for the usual $c_{2} = 0.9$ . However, the strong version would require $| - 1 | = 1 \leq c_{2}$ , thus rejecting this step. Thus, to summarize, the strong Wolfe condition prevents not only small steps but also those that are too aggressive.

Quasi-Newton Methods

Newton's Method computes the update as

x_{k + 1} = x_{k} - {\underset{B_{k}}{\underset{⏟}{[\nabla^{2} f (x_{k})]}}}^{- 1} \nabla f (x_{k}) .

Near the solution, this has quadratic convergence, but further away, it may not even be a descent method. Further, computing the Hessian is expensive, especially as the size of $x$ grows.

Quasi-Newton methods approximate the Hessian with an iteratively updated $B_{k} .$ This must satisfy the secant equation:

B_{k + 1} \underset{s_{k}}{\underset{⏟}{(x_{k + 1} - x_{k})}} = \underset{y_{k}}{\underset{⏟}{\nabla f (x_{k + 1}) - \nabla f (x_{k})}} .

Note that $B_{k}$ depends on $x_{k}$ , (not $x_{k + 1}$ ), and $B_{k + 1}$ is used to find $x_{k + 2}$ . We can view this as a backward Finite Difference approximation of the Hessian, along the axis aligning with $x_{k + 1} - x_{k}$ . In 1D this gives a single formula,

B_{k + 1} = \frac{f^{'} (x_{k + 1}) - f^{'} (x_{k})}{x_{k + 1} - x_{k}} \approx f^{″} (x_{k + 1}) .

There are multiple unique ways of constructing $B_{k}$ , thus differentiating Symmetric Rank 1 (SR1) from Broyden-Fletcher-Goldfarb-Shanno (BFGS) methods.

BFGS methods take $H_{k} = B_{k}^{- 1}$ , and

p_{k} = - H_{k} \nabla f (x_{k}),

where taking the update to the approximated Hessian is specified as

B_{k + 1} = B_{k} + \frac{y_{k} y_{k}^{T}}{y_{k}^{T} s_{k}} - \frac{B_{k} s_{k} s_{k}^{T} B_{k}^{T}}{s_{k}^{T} B_{k} s_{k} .}

Instead of computing $B_{k + 1}^{- 1}$ , we can use the Sherman-Morrison formula to update the inverse from the previous step, as this is a rank 1 (?) update.

TODO

Go through the PyTorch implementation of LBFGS. Pay particular attention to how the line-search is done and the other heuristics that are used.

AdaGrad

AdaGrad adds a preconditioner that is based on the gradient history, weighting updates to commonly updated parameters less than infrequently updated parameters.
Denoting $g_{k} = \nabla f (x_{k})$ as the gradient at step $k$ ,

G = \sum_{i = 1}^{k} g_{k} g_{k}^{T} .

The update is then given as

x_{k + 1} = x_{k} - η \cdot diag (G)^{- \frac{1}{2}} ⊙ g_{k} .

In other words, the preconditioner for parameter $j$ divides by the sum of the $ℓ^{2}$ Norm of the history of updates (a vector with $k$ entries, not $n$ ).

TODO

Resources

Related

Main Idea

Line-search Methods

Wolfe Conditions

Quasi-Newton Methods

AdaGrad

Trust-Region Methods