Higher-Order Methods for Deep Learning: Curvature, Rates, and Practicality

1. Why higher-order?

First-order methods are cheap but can be curvature-blind. Higher-order methods use Hessian or Fisher structure to improve conditioning.

2. Newton and damped Newton

Newton step:

Δθ_t = -H_t^{-1}g_t

where g_t = ∇L(θ_t) and H_t = ∇²L(θ_t).

Damped version uses (H_t + λI)^{-1}.

Local quadratic convergence theorem

Assume L twice continuously differentiable, Hessian Lipschitz near minimizer θ⋆, and H(θ⋆) ≻ 0. Then pure Newton iterates sufficiently close to θ⋆ satisfy

||θ_{t+1} - θ⋆|| ≤ C||θ_t - θ⋆||²

Use Taylor expansion of gradient around θ_t and inverse-function stability of Hessian in neighborhood of optimum; first-order error cancels, leaving second-order residual proportional to Hessian-Lipschitz constant.

3. Cubic regularization global rate

At iteration t, solve approximately

min_s m_t(s) = g_t^T s + (1/2)s^T H_t s + (ρ/6)||s||³

This method has worst-case complexity to reach ||∇L|| ≤ ε of

O(ε^{-3/2})

improving over gradient descent's O(ε^{-2}) in nonconvex smooth settings.

4. Natural gradient and information geometry

Define Fisher matrix

F(θ) = E_{x~D, y~p_θ(·|x)}[∇log p_θ(y|x)∇log p_θ(y|x)^T]

Natural gradient step:

Δθ_t = -ηF(θ_t)^{-1}g_t

It is steepest descent in distribution space under KL metric, giving reparameterization invariance (locally).

Proposition (trust-region equivalence, first order)

Solve

min_Δ g^T Δ s.t. (1/2)Δ^T FΔ \leq ε

Lagrangian yields

Δ⋆ = -√(2ε/(g^T F^{-1}g)) F^{-1}g

So natural gradient is the trust-region direction under KL-induced quadratic approximation.

5. Practical approximations: K-FAC and Shampoo

Exact Hessian/Fisher inversion is infeasible at scale. Structured approximations:

K-FAC: Kronecker-factorized blocks per layer.
Shampoo: tensor preconditioning with matrix roots.
L-BFGS variants: low-memory curvature estimates.

Cost-benefit depends on batch size, model architecture, and hardware communication overhead.

6. When higher-order wins

Higher-order methods provide strongest gains when:

curvature anisotropy is high,
batch sizes are large enough for stable curvature estimates,
wall-clock bottleneck is optimization steps rather than per-step compute.

Practical considerations:

Newton methods excel in well-conditioned regions with strong curvature
Natural gradients provide parameterization invariance for probabilistic models
Modern approximations like K-FAC make higher-order methods feasible at scale
Success depends on matching method characteristics to problem structure and hardware constraints