Pion: Rethinking Muon Beyond LLM Pretraining
Muon orthogonalizes the momentum matrix and pushes every singular value to one. This works beautifully for LLM pretraining, which is essentially next token classification on text via supervised learning. But what happens when we move along three orthogonal axes: a different modality, a different loss,...
#optimizer#muon#vla#rlvr#spectral