Multi-Objective Optimization

Resources

Quinton, Pierre, and Valérian Rey. 2024. “Jacobian Descent for Multi-Objective Optimization.” arXiv. http://arxiv.org/abs/2406.16232.
Boyd, Stephen, and Lieven Vandenberghe. 2004. “Convex Optimization.” http://www.cambridge.org.

Main Idea

Within Optimization, there are multiple objective functions, and the goal is formulated as

min_{x \in X} \underset{f (x)}{\underset{⏟}{[\begin{matrix} f_{1} (x) \\ f_{2} (x) \\ ⋮ \\ f_{k} (x) \end{matrix}]}},

with $k \geq 2$ , and $X$ is the feasible set, typically $X \subset R^{n}$ (which can incorporate constraints), and $f : X \to R^{k}$ . Denote $Y = {f (x) : x \in X} \subset R^{k}$ .

Details

A feasible solution $x_{1} \in X$ is said to Pareto dominate another solution $x_{2} \in X$ , if

\begin{aligned} \forall i \in [k], f_{i} (x_{1}) \leq f_{i} (x_{2}), and \\ \exists i \in [k], f_{i} (x_{1}) < f_{i} (x_{2}) . \end{aligned}

That is, $x_{1}$ dominates $x_{2}$ , if it is at least as good as $x_{2}$ on every objective function, and strictly better on at least one. A solution $x^{*} \in X$ is Pareto optimal (or Pareto efficient) if no other solution dominates it:

\neg (\exists x \in X : x dominates x^{*}) \equiv \forall x \in X : x does not dominate x^{*} .

The Pareto front is $X^{*} = {x^{*} \in X : x^{*} is Pareto optimal}$ . This is bounded by the nadir objective vector, $z^{n a d i r}$ , which is a collection of the worst (greatest) objective values achieved by $x^{*} \in X^{*}$ . Conversely, he ideal objective vector $z^{i d e a l}$ is a collection of the best (least) objective values. Thus, for any $x^{*} \in X^{*}$ (or even $x^{*} \in X$ ), $z^{i d e a l} \leq f (x^{*}) \leq z^{n a d i r}$ , in a component-wise sense.

Scalarization

The process of scalarization converts $f$ to a scalar-valued function, which is instead optimized. The global criterion scalarization problem considers

min_{x \in X} | | f (x) - z^{i d e a l} | |,

where the norm can be any Lebesgue Space norm, e.g. $L^{1}$ , $L^{2}$ , $L^{\infty}$ . Note that this depends on the scale of the objective functions, so some uniform, dimensionless scale is recommended.

More generally, a scalarization uses a function $g : R^{k} \times R^{| θ |} \to R$ (potentially with parameters $θ$ ) to convert to a single-objective problem:

\begin{aligned} min & g (f_{1} (x), \dots, f_{k} (x), θ), \\ s.t. & x \in X_{θ} . \end{aligned}

Here, we also have $X_{θ} \subset X$ .

Linear Scalarization

$min_{x \in X} \sum_{i = 1}^{k} θ_{i} f_{i} (x), θ_{i} > 0$

ϵ

-Constraint Method

$\begin{aligned} min & f_{j} (x), \\ s.t. & x \in X \\ f_{i} (x) \leq ε_{i}, i \in [k] ∖ {j} . \end{aligned}$

Here, the parameters $θ$ correspond to $ε_{i}$ . We have applied this approach in Partial Differential Equation Discovery. Given $X = R^{n}$ , this requires solving a Constrained Optimization problem, as opposed to an Unconstrained Optimization problem for the linear scalarization.

Other methods include "hypervolume/Chebyshev scalarization".

Jacobian Descent

Recalling $f : R^{n} \to R^{k}$ , the Jacobian of $f$ is $J f : R^{n} \to R^{k \times n}$ (the transpose of what stacking gradients would look like). This is a matrix for a given input $x$ , with entries $[J f]_{i j} = \frac{\partial f_{i}}{\partial x_{j}}$ , or

J f = [\begin{matrix} - \nabla f_{1}^{T} - \\ - \nabla f_{2}^{T} - \\ ⋮ \\ - \nabla f_{k}^{T} - \end{matrix}]

A Taylor expansion about $x$ with perturbation $Δ x$ gives

f (x + Δ x) = f (x) + [J f (x)] Δ x + O (| | Δ x | |^{2}) .

We define an aggregator function $A : R^{k \times n} \to R^{n}$ . In other words, $A$ converts from the Jacobian to a vector of the same size as the gradient, allowing for a similar update rule to that of Gradient Descent, with $η > 0$ ,

x_{j + 1} = x_{j} \underset{Δ x}{\underset{⏟}{- η A [J f (x_{j})]}} .

For example, when $k = 1$ , we take $A$ as the transpose operation and recover gradient descent exactly.

We want to decrease $f$ across all of its entries. Thus, according to the Taylor expansion, we require

f (x + Δ x) - f (x) = (J f (x)) (\underset{Δ x}{\underset{⏟}{- η A [J f (x)]}}) ≼ 0,

where $≼$ is a relation for the (natural) Partial Order on $R^{k}$ ( $a ≼ 0 \equiv a_{i} \leq 0, \forall i$ ). Take $u$ as row $i$ of $J f (x)$ ( $k \times n$ ), and $v$ as $A [J f (x)]$ (which is just a single column). If

- u^{T} A [J f (x)] > 0,

then the objective function $f_{i}$ will increase with this step (according to the first-order Taylor Expansion). This does not always mean that $x$ pareto dominates $x + Δ x$ , as if another objective function decreases, this is not the case.

Definition

Let $A : R^{k \times n} \to R^{n}$ be an aggregator. For any $J \in R^{k \times n}$ , the aggregator and $J$ are nonconflicting if

J A (J) ≽ 0 .

Preconditioned Gradient Descent (Normalized Steepest Descent)

Let $k = 1$ . Consider an $n \times n$ matrix $P$ . Normalized steepest descent in the space described by $| | ∙ | |_{P}$ gives the update as

Δ x = - P^{- 1} \nabla f (x) .

With the same restriction that the step decreases the objective function, we require that

\nabla f (x)^{T} \underset{Δ x}{\underset{⏟}{(- P^{- 1} \nabla f (x))}} \leq 0.

$P$ can not depend on $\nabla f (x)$ , so this must hold for any $\nabla f (x)$ . Thus, we require that $P^{- 1}$ is positive semi-definite. The corresponding $A (J) = P^{- 1} J^{T}$ is nonconflicting for all $J \in R^{1 \times n}$ , as $J P^{- 1} J^{T} \geq 0$ .

In other words, the nonconflicting requirement on $J$ and $A$ is that of positive semi-definiteness.

Definition

Let $A : R^{k \times n} \to R^{n}$ be an aggregator. If for all $J \in R^{k \times n}$ and $c \in R^{k}$ , $c ≽ 0$ , the mapping

c \to A (diag (c) \cdot J)

is linear, then $A$ is linear under scaling.

Here, $c$ is some weighting of the rows of $J$ , meaning that if a given of $J$ changes by a constant factor, it's contribution to $Δ x$ (through $A$ ) should change similarly.

A weighted aggregator can be written as

A (J) = J^{T} \cdot w,

for some $w \in R^{k}$ , for all $J$ . This corresponds to the linear scalarization approach.

The Projection of $x \in R^{n}$ onto the dual cone of the rows of $J$ is

π_{J} (x) = \underset{y, J y ≽ 0}{\arg \min} | | y - x | |^{2}

We can think of the $J y ≽ 0$ constraint as a pointed cone in $R^{n}$ . Adding more rows to $J$ can only make this cone more narrow. If $Δ x$ is within this cone of $J f (x)$ , then according to the first-order Taylor Expansion, taking the step should reduce $f$ , for all of its entries. The projection is the nearest point such that this happens, or the current point, if it already reduces $f$ for all of its entries. It's possible to have a case where the dual cone is just ${0}$ , in which case the projection will always just be $0$ . Intuitively, the dual cone is the set within 90 degrees of each row of $J$ .

Question

How to implement $π_{J}$ in some efficient way?

Resources

Related

Main Idea

Details

Scalarization

Jacobian Descent