Modern Optimizers – An Alchemist's Notes on Deep Learning

derbOac · 2025-11-12T03:48:09 1762919289

Interesting read and interesting links.

The entry asks "why the square root?"

On seeing it, I immediately noticed that with log-likelihood as the loss function, the whitening metric looks a lot like the Jeffreys prior or an approximation (https://en.wikipedia.org/wiki/Jeffreys_prior), which is a reference prior when the CLT holds. The square root can be derived from the reference prior structure, but also has the effect in a lot of modeling scenarios of scaling things proportionally to the scale of the parameters (for lack of a better way of putting it; think standard error versus sampling variance).

If you think of the optimization method this way, you're essentially reconstructing a kind of Bayesian criterion with a Jeffreys prior.

big-chungus4 · 2025-11-12T17:11:30 1762967490

>Likely, there is a method that can use the orthogonalization machinery of Muon while keeping the signal-to-noise estimation of Adam, and this optimizer will be great.

if you take SOAP and change all betas to 0, it still works well, so SOAP is that already

big-chungus4 · 2025-11-12T17:57:50 1762970270

I personally think we've hit the limit and no more better optimizers are to be developed in my humble opinion

big-chungus4 · 2025-11-12T17:58:21 1762970301

best we can do is something like make SOAP faster by replacing QR with something cheaper and maybe warm started

big-chungus4 · 2025-11-12T16:55:52 1762966552

the square root is from PCA/ZCA whitening, what it does it it makes empirical covariance of gradients become identity, so they become decorellated, which is exactly what hessian does on a quadratic objective by the way

big-chungus4 · 2025-11-12T17:56:42 1762970202

https://en.wikipedia.org/wiki/Whitening_transformation for ZCA whitening

big-chungus4 · 2025-11-12T17:24:14 1762968254

which PSGD did you use because there is apparenly like a million of them