TL;DR
- Muon is the de facto matrix aware optimizer for LLM pretraining, which is basically next token classification on text via supervised learning. Muon orthogonalizes the momentum matrix through a few Newton Schulz (NS) iterations, pushing every singular value of the momentum to one.
- When we move beyond LLM pretraining along three axes (a different modality, a different loss, or a different learning paradigm), Muon’s uniform spectral whitening turns out to be the wrong inductive bias.
- We propose Pion (sPectral hIgh pass Optimization on momeNtum), a drop in replacement for Muon’s NS iteration. It changes only the polynomial coefficients used inside NS, keeps the same per step cost, and realizes a sharp spectral high pass that anchors the informative leading singular values at one while suppressing the noisy tail toward zero.
Background: Muon and Three Axes Beyond Pretraining
For a weight matrix \(\boldsymbol{\Theta} \in \mathbb{R}^{m \times n}\), given a stochastic gradient \(\mathbf{G}_t\) and a momentum buffer \(\mathbf{M}_t = \mu \mathbf{M}_{t-1} + \mathbf{G}_t\), Muon performs the steepest descent step under the spectral norm:
\[\boldsymbol{\Theta}_t = \boldsymbol{\Theta}_{t-1} - \eta \, \mathrm{msign}(\mathbf{M}_t).\]If \(\mathbf{M} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top\) is the compact SVD, then
\[\mathrm{msign}(\mathbf{M}) = \mathbf{U}\, \mathrm{sign}(\boldsymbol{\Sigma})\, \mathbf{V}^\top = \mathbf{U} \mathbf{V}^\top.\]Every nonzero singular value is mapped to one. This is uniform spectral whitening. Computing an SVD per step is too expensive at scale, so Muon approximates \(\mathrm{msign}\) by a small number of Newton Schulz (NS) iterations. After normalizing the input as \(\mathbf{X} \leftarrow \mathbf{X} / (\|\mathbf{X}\|_F + \epsilon)\), each NS step applies an odd quintic matrix polynomial,
\[\mathbf{X} \leftarrow a\, \mathbf{X} + b\, \mathbf{X}\mathbf{X}^\top \mathbf{X} + c\, \mathbf{X}(\mathbf{X}^\top \mathbf{X})^2,\]with the canonical coefficients \((a, b, c) = (3.4445,\ -4.7750,\ 2.0315)\). By the identity \(\mathbf{X}(\mathbf{X}^\top \mathbf{X})^j = \mathbf{U}\, \boldsymbol{\Sigma}^{2j+1}\, \mathbf{V}^\top\), an NS step preserves the singular vectors and reshapes each singular value through a scalar polynomial on \([0, 1]\):
\[f(\sigma;\, a, b, c) \,\triangleq\, a\sigma + b\sigma^3 + c\sigma^5.\]So designing an NS iteration reduces to designing \(f\) on \([0, 1]\). Muon’s NS is constructed so that repeated application drives every \(\sigma \in (0, 1]\) toward one. The shape of this scalar map (and Pion’s eventual replacement for it) is the central object we visualize in Figure 1.
LLM pretraining is a classification task (predict the next token), uses text as the only modality, and runs under supervised learning. With clean dense supervision and full rank text gradients, Muon’s uniform whitening is a sensible inductive bias. But LLM pretraining occupies just one corner of the design space; three orthogonal axes can move us out of that corner.
| Axis | LLM pretraining | Generalization direction | Representative testbed |
|---|---|---|---|
| Modality | text | vision, robot action | VLA training (vision language action) |
| Loss | classification (next token) | regression, generative (flow matching) | VLA action heads (\(\ell_1\) regression and flow matching) |
| Paradigm | supervised learning | reinforcement learning | RLVR post training (GRPO, GMPO) |
VLA training combines axes 1 and 2 (it adds the action modality and replaces classification with regression or generative losses), and RLVR isolates axis 3 (it keeps the LLM and its tokenizer but switches from supervised next token loss to policy gradient on a verifiable reward). The rest of the page asks the same question along each axis: does Muon’s uniform spectral whitening still help, or does it become the wrong inductive bias?
Beyond Modality and Loss: VLA
A VLA model is factorized into a vision encoder, a language backbone, and an action head. The vision and language modules still consume text and image tokens, but the action head is a new modality (it consumes joint actions of robots). The action head also uses non classification losses: either an \(\ell_1\) regression head (e.g., VLA Adapter) or a flow matching generative head (e.g., VLANeXt). These two design choices, the action modality and the regression style loss, are tightly coupled by construction.
We measure the spectral structure of each module’s gradient via the effective rank (erank) of \(\mathbf{G} \in \mathbb{R}^{m \times n}\):
\[\mathrm{erank}(\mathbf{G}) \,\triangleq\, \exp\!\Big( H(\mathbf{p}) \Big), \quad H(\mathbf{p}) = -\sum_{i=1}^n p_i \log p_i, \quad p_i = \frac{\sigma_i(\mathbf{G})}{\sum_j \sigma_j(\mathbf{G})}.\]A higher erank means gradient energy is spread across many singular directions; a lower erank means it concentrates in a few dominant ones.



Figure 2. Limitations of Muon in VLA training (VLA Adapter on LIBERO Object). (a) Per module gradient erank along the training trajectory. (b)(c) Test success rate and total training time at 4.5k steps, with vision and language fixed at AdamW; only the action module optimizer differs.
The ordering in panel (a) is stable across training: vision is highest, language is intermediate, the action gradient is consistently the lowest. Both the modality switch (rich pixels and tokens become a tiny continuous action vector) and the loss switch (one hot classification becomes regression or generative matching) push the action gradient toward low effective rank. When Muon is applied uniformly to such a low erank gradient, it lifts the noisy tail directions to the same magnitude as the few informative leading directions, and the resulting update is dominated by spectral floor noise. Panel (b) shows Muon underperforming AdamW on the action head. A natural workaround, Low Rank Muon (LRMuon), projects the momentum onto a top \(k\) subspace via SVD or Gaussian sketching before NS. LRMuon does close the accuracy gap, but as panel (c) shows, the explicit projection inflates wall clock by about an order of magnitude, and forces a fixed rank \(k\) that cannot adapt across layers and steps.
Limitation 1 (modality + loss). Conventional Muon is not adaptive to the rank heterogeneity introduced by new modalities and non classification losses. Explicit low rank projection helps but breaks scalability.
Beyond Paradigm: RLVR
RLVR keeps the LLM and the text modality; it changes the learning paradigm. Instead of a token level supervised loss, the policy is updated by policy gradient against a rule based, verifiable reward. We use GRPO and GMPO as representative algorithms.
To compare paradigms, we measure the per step gradient signal to noise ratio of a layer’s weight matrix:
\[\mathrm{SNR}(\mathbf{G}) \,\triangleq\, \frac{\|\mathbb{E}[\mathbf{G}]\|_F^2}{\mathbb{E}\big[\,\|\mathbf{G} - \mathbb{E}[\mathbf{G}]\|_F^2\,\big]}.\]

Figure 3. RLVR diagnosis on Qwen3 1.7B (MATH levels 3 to 5). (a) GRPO has substantially lower gradient SNR than SFT throughout training. (b) Under GRPO, AdamW improves steadily while Muon collapses to near zero accuracy within a few steps.
Two structural reasons explain the SNR gap in panel (a). First, coarser supervision granularity: SFT receives token level teacher signals, GRPO uses trajectory level rewards, so each token gets a much sparser learning signal. Second, stabilization mechanisms: importance sampling, clipping, and group relative normalization reweight or zero out parts of the per token gradients, which inflates variance. When Muon is run on top of these low SNR gradients (panel b), the uniform whitening lifts the noisy directions to the same magnitude as the informative ones, and the policy collapses within a few steps. A second issue specific to the RLVR paradigm is that Muon’s NS treats each weight matrix as a single block, ignoring the per head specialization that pretraining imprints on attention projections.
Limitation 2 (paradigm). Muon’s uniform spectral whitening amplifies noisy directions in low SNR RLVR gradients, making it ill suited for noise sensitive post training. It also discards the per head structure inherited from pretraining.
Pion: A Spectral High Pass Optimizer
A unified spectral diagnosis. Although Limitations 1 and 2 originate from different sources (low effective rank along the modality and loss axes, low SNR along the paradigm axis), they share one spectral signature. In the SVD of \(\mathbf{M}_t\), the few leading singular values carry the informative descent direction, while the long tail of small singular values is dominated by noise: a spectral floor when erank is low, stochastic estimation noise when SNR is low. Muon’s \(\mathrm{msign}\) lifts the tail to the magnitude of the head and corrupts the update in both regimes. The natural remedy is a spectral high pass: anchor the informative head near one, contract the noisy tail toward zero.
A two stage high pass NS iteration. Since each NS step reshapes \(\sigma \in [0, 1]\) through the scalar polynomial \(f(\sigma; a, b, c) = a\sigma + b\sigma^3 + c\sigma^5\), designing an NS iteration reduces to designing \(f\). A single such polynomial cannot produce a sharp high pass on the unit interval, so Pion splits the default \(k = 5\) NS steps into two stages with different coefficients:
- a Promotion polynomial \(f_{\mathrm{p}}\) applied for \(k_{\mathrm{p}}\) steps, which lifts dominant singular values toward one while preserving their relative order;
- a Suppression polynomial \(f_{\mathrm{s}}\) applied for \(k_{\mathrm{s}} = k - k_{\mathrm{p}}\) steps, which pins large singular values near one and contracts smaller ones toward zero.
The cutoff is controlled by the single hyperparameter \(k_{\mathrm{p}} \in \{0, 1, \ldots, 5\}\).
We require three constraints on \(f_{\mathrm{p}}\): (P1) fixed point \(f_{\mathrm{p}}(1) = 1\); (P2) first order stationarity \(f_{\mathrm{p}}'(1) = 0\); and (P3) boundary concavity \(f_{\mathrm{p}}''(1) \leq 0\), which together with (P2) ensures \(\sigma = 1\) is a maximum so that the iteration does not curve upward past one near the boundary. Solving (P1) and (P2) leaves a one parameter family. Combining (P3) with monotonicity on \([0, 1]\) carves out the feasible interval \(a_{\mathrm{p}} \in [0, 1.875]\). Since \(f_{\mathrm{p}}'(0) = a_{\mathrm{p}}\) controls how strongly each step lifts small singular values, we pick the largest feasible slope, which uniquely determines the polynomial:
\[f_{\mathrm{p}}(\sigma) = 1.875\, \sigma \,-\, 1.25\, \sigma^3 \,+\, 0.375\, \sigma^5.\]A pleasant byproduct is that the derivative becomes a perfect square, \(f_{\mathrm{p}}'(\sigma) = 1.875\, (1 - \sigma^2)^2 \geq 0\), so monotonicity on \([0, 1]\) holds automatically.
The Suppression polynomial inherits \(f_{\mathrm{s}}(1) = 1\) and \(f_{\mathrm{s}}'(1) = 0\), and adds a spectral filtering condition \(f_{\mathrm{s}}'(0) = 0\). Removing the linear term near the origin forces small singular values to be driven to zero by the higher order terms. The unique solution is
\[f_{\mathrm{s}}(\sigma) = 2.5\, \sigma^3 \,-\, 1.5\, \sigma^5.\]Chaining \(k_{\mathrm{p}}\) Promotion steps with \(k_{\mathrm{s}}\) Suppression steps gives Pion’s high pass NS iteration. Fixing \(k = 5\) preserves Muon’s per step cost. Panel (d) of Figure 1 below shows the resulting profile: a sharp transition between a pinned region near one and a filtered region near zero, with \(k_{\mathrm{p}}\) controlling the cutoff. Empirically, suppression dominant allocations with \(k_{\mathrm{s}} \geq 3\) work best for both VLA and RLVR.




Figure 1. Visualization of \(f(\sigma)\) on \(\sigma \in [0, 1]\). Muon (a) drives every singular value toward one. Pion combines Promotion (b) with Suppression (c) to obtain the high pass profile in (d).
Per head mode for RLVR. Pion has two application modes. The default mode applies the iteration to each weight matrix as a single block (mirrors Muon). The per head mode first reshapes each attention projection along its head dimension into multiple per head sub matrices, then runs the iteration independently on each. Since the inner products \(\mathbf{X}^\top \mathbf{X}\) are already batched along the head dimension after the reshape, the per head mode is free over the default mode. We use the default mode for VLA (the action head is trained from scratch and has no head structure to preserve) and the per head mode for RLVR. RLVR starts from a pretrained LLM whose attention layers have heterogeneous per head Frobenius norms, and these per head norms govern attention sharpness and gradient magnitudes.


Figure 4. Effect of per head high pass NS on RLVR (Qwen3 1.7B, GRPO on MATH levels 3 to 5). (a) Accuracy of AdamW, Muon (default vs per head), and Pion (default vs per head). Per head Pion is the only configuration that beats AdamW; per head mode does not save Muon, since the lack of noise adaptiveness remains the primary issue. (b) Cross head Q projection variance: before RLVR weight (top) and after RLVR update for default vs per head Pion (bottom).
This isolates two complementary findings. The spectral high pass is the primary driver of RLVR stability (per head Muon still collapses), and the per head reshape is an auxiliary mechanism that preserves head heterogeneity inherited from pretraining and pushes per head Pion past AdamW.
Takeaway. Pion is a drop in replacement for Muon’s NS iteration. Same control flow, same per step cost; only the polynomial coefficients change.
Experiments
Pion is evaluated on two testbeds that span all three axes:
- VLA training (axes 1 and 2) with two architectures, \(\ell_1\) regression based VLA Adapter and flow matching based VLANeXt, on LIBERO and LIBERO Plus.
- RLVR post training (axis 3) with GRPO and GMPO on Qwen3 1.7B and Qwen3 4B over MATH and GSM8K.
Three optimizer configurations are compared in each setting: AdamW, Muon, and Pion.



Figure 5. AdamW, Muon, and Pion for VLA Adapter on LIBERO. Pion outperforms Muon on every one of the four LIBERO suites, reaches 95.4% success at 500 steps on Object, and saturates at 100% by 1500 steps.
For VLANeXt (flow matching), Pion not only wins on the clean LIBERO benchmark but also amplifies its advantage on the harder LIBERO Plus split, especially under language (\(+9\) pts), noise (\(+6\) pts), and robot (\(+6\) pts) perturbations. This is consistent with the picture that uniform whitening overamplifies non generalizable noise directions.
| Optimizer | LIBERO | LIBERO Plus | Background | Camera | Language | Layout | Light | Noise | Robot |
|---|---|---|---|---|---|---|---|---|---|
| AdamW | 79.45 | 64.57 | 68.97 | 70.38 | 54.50 | 61.80 | 76.35 | 66.37 | 47.04 |
| Muon | 93.65 | 72.34 | 82.72 | 68.00 | 77.53 | 76.21 | 86.17 | 69.98 | 57.36 |
| Pion (Ours) | 96.35 | 75.93 | 84.53 | 70.88 | 86.93 | 76.71 | 90.67 | 76.09 | 63.18 |
Table 1. AdamW, Muon, and Pion for VLANeXt on LIBERO and LIBERO Plus. Best in bold.
Across eight RLVR settings (GRPO and GMPO times Qwen3 1.7B and Qwen3 4B times MATH and GSM8K), Muon consistently collapses to near zero accuracy. Pion not only recovers a meaningful training signal but also outperforms AdamW with faster convergence.








Figure 6. AdamW, Muon, and Pion on RLVR: validation accuracy vs training step across eight settings (two algorithms × two model sizes × two benchmarks).
Reverse ablation: direction of spectral shaping matters. To verify that the gains come specifically from the high pass direction, we construct Low pass Muon (LPMuon), which mirrors Pion in NS structure and per step cost but flips the polynomial coefficients to induce a low pass mapping (contracts large singular values, amplifies small ones). LPMuon fails to train: its accuracy stays at the initial checkpoint.


Figure 7. (a) Scalar map \(f(\sigma)\) of LPMuon. (b) GSM8K accuracy of AdamW, Pion, and LPMuon (Qwen3 1.7B, GRPO). Combined with Muon's failure (no filtering) in Figure 6, this isolates the direction of spectral shaping as the key factor.
BibTeX
@article{fan2026pion,
title = {Pion: Rethinking Muon Beyond Pretraining:
Spectral Failures and High-Pass Remedies
for VLA and RLVR},
author = {Fan, Chongyu and Liu, Gaowen and Hong, Mingyi
and Kompella, Ramana Rao and Liu, Sijia},
journal = {Preprint},
year = {2026}
}