Rethinking Muon Beyond Pretraining: Spectral Failures and High Pass Remedies for VLA and RLVR
Muon orthogonalizes the momentum matrix and pushes every singular value to one. This works beautifully for LLM pretraining, which is essentially next token classification on text via supervised learning. But what happens when we move along three orthogonal axes: a different modality, a different loss,...
#optimizer#vla#rlvr