Kiyani, Elham, Khemraj Shukla, Jorge F. Urbán, Jérôme Darbon, and George Em Karniadakis. 2025. “Which Optimizer Works Best for Physics-Informed Neural Networks and Kolmogorov-Arnold Networks?” arXiv. https://doi.org/10.48550/arXiv.2501.16371.
Supposedly, second-order methods (using information) are categorized as either line-search methods or trust-region methods, by restricting or first.
Line-search Methods
After determining , is determined based on the 1-dimensional problem,
Usually, this is solved approximately, by an inexact line search.
Wolfe Conditions
The Armijo rule is
In short, for a descent method, this condition requires that is reduced "enough", based on and . For instance, if both of those are very large and in the same direction, then we should be able decrease more substantially than if they are small, or do not agree with one another. Based on this, it is also referred to as the sufficient decrease rule.
The curvature rule is
We can rearrange to reveal a term resembling , which is part of the numerator of a forward difference approximating the Hessian along . The direction of the inequality corresponds to positive concavity. To further understand this rule, suppose , then
If is a descent direction, then , so this would require
yet, we often take , so this condition would be violated.
The strong Wolfe condition on curvature extends this idea, requiring
Suppose and , in other words, would be so aggressive that the slope becomes the opposite. The (weak) condition on curvature would have:
which is satisfied for the usual . However, the strong version would require , thus rejecting this step. Thus, to summarize, the strong Wolfe condition prevents not only small steps but also those that are too aggressive.
Near the solution, this has quadratic convergence, but further away, it may not even be a descent method. Further, computing the Hessian is expensive, especially as the size of grows.
Quasi-Newton methods approximate the Hessian with an iteratively updated This must satisfy the secant equation:
Note that depends on , (not ), and is used to find . We can view this as a backward Finite Difference approximation of the Hessian, along the axis aligning with . In 1D this gives a single formula,
There are multiple unique ways of constructing , thus differentiating Symmetric Rank 1 (SR1) from Broyden-Fletcher-Goldfarb-Shanno (BFGS) methods.
BFGS methods take , and
where taking the update to the approximated Hessian is specified as
Instead of computing , we can use the Sherman-Morrison formula to update the inverse from the previous step, as this is a rank 1 (?) update.
TODO
Go through the PyTorch implementation of LBFGS. Pay particular attention to how the line-search is done and the other heuristics that are used.
AdaGrad
AdaGrad adds a preconditioner that is based on the gradient history, weighting updates to commonly updated parameters less than infrequently updated parameters.
Denoting as the gradient at step ,
The update is then given as
In other words, the preconditioner for parameter divides by the sum of the Norm of the history of updates (a vector with entries, not ).