Pion: Rethinking Muon Beyond LLM Pretraining

TL;DR

Muon is the de facto matrix aware optimizer for LLM pretraining, which is basically next token classification on text via supervised learning. Muon orthogonalizes the momentum matrix through a few Newton Schulz (NS) iterations, pushing every singular value of the momentum to one.
When we move beyond LLM pretraining along three axes (a different modality, a different loss, or a different learning paradigm), Muon’s uniform spectral whitening turns out to be the wrong inductive bias.
We propose Pion (sPectral hIgh pass Optimization on momeNtum), a drop in replacement for Muon’s NS iteration. It changes only the polynomial coefficients used inside NS, keeps the same per step cost, and realizes a sharp spectral high pass that anchors the informative leading singular values at one while suppressing the noisy tail toward zero.

Where Muon Lives Today: LLM Pretraining

For a weight matrix $\boldsymbol{\Theta} \in \mathbb{R}^{m \times n}$, given a stochastic gradient $\mathbf{G}_t$ and a momentum buffer $\mathbf{M}_t = \mu \mathbf{M}_{t-1} + \mathbf{G}_t$, Muon performs the steepest descent step under the spectral norm:

\[\boldsymbol{\Theta}_t = \boldsymbol{\Theta}_{t-1} - \eta \, \mathrm{msign}(\mathbf{M}_t).\]

If $\mathbf{M} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top$ is the compact SVD, then

\[\mathrm{msign}(\mathbf{M}) = \mathbf{U}\, \mathrm{sign}(\boldsymbol{\Sigma})\, \mathbf{V}^\top = \mathbf{U} \mathbf{V}^\top.\]

Every nonzero singular value is mapped to one. This is what we call uniform spectral whitening.

Computing an SVD per step is too expensive at scale, so Muon approximates $\mathrm{msign}$ by a small number of Newton Schulz (NS) iterations. After normalizing the input as $\mathbf{X} \leftarrow \mathbf{X} / (\|\mathbf{X}\|_F + \epsilon)$, each NS step applies an odd quintic matrix polynomial,

\[\mathbf{X} \leftarrow a\, \mathbf{X} + b\, \mathbf{X}\mathbf{X}^\top \mathbf{X} + c\, \mathbf{X}(\mathbf{X}^\top \mathbf{X})^2,\]

with the canonical coefficients $(a, b, c) = (3.4445,\ -4.7750,\ 2.0315)$. By the identity $\mathbf{X}(\mathbf{X}^\top \mathbf{X})^j = \mathbf{U}\, \boldsymbol{\Sigma}^{2j+1}\, \mathbf{V}^\top$, an NS step preserves the singular vectors and reshapes each singular value through a scalar polynomial on $[0, 1]$:

\[f(\sigma;\, a, b, c) \,\triangleq\, a\sigma + b\sigma^3 + c\sigma^5.\]

So designing an NS iteration reduces to designing $f$ on $[0, 1]$. Muon’s NS is constructed so that repeated application drives every $\sigma \in (0, 1]$ toward one (panel a below).

Figure 1. Visualization of $$f(\sigma)$$ on $$\sigma \in [0, 1]$$. Muon (a) drives every singular value toward one. Pion combines Promotion (b) with Suppression (c) to obtain the high pass profile in (d).

LLM pretraining is a classification task (predict the next token over the vocabulary), uses text as the only modality, and is performed under supervised learning. With clean dense supervision and full rank text gradients, Muon’s uniform whitening is a sensible inductive bias: it spreads optimization signal uniformly across all spectral directions, and consistently improves over AdamW.

Three Axes of Generalization Beyond LLM Pretraining

LLM pretraining occupies one specific corner of the design space. Three orthogonal axes can move us out of that corner.

Axis	LLM pretraining	Generalization direction	Representative testbed
Modality	text	vision, robot action	VLA training (vision language action)
Loss	classification (next token)	regression, generative (flow matching)	VLA action heads ($\ell_1$ regression and flow matching)
Paradigm	supervised learning	reinforcement learning	RLVR post training (GRPO, GMPO)

VLA training naturally combines axes 1 and 2: it adds the action modality and replaces classification with regression or generative losses. RLVR isolates axis 3: it keeps the LLM and its tokenizer but switches from supervised next token loss to policy gradient on a verifiable reward.

The rest of the page asks the same question along each axis: does Muon’s uniform spectral whitening still help, or does it become the wrong inductive bias?

Axes 1 and 2: A Different Modality and a Different Loss (VLA)

A VLA model is factorized into a vision encoder, a language backbone, and an action head. The vision and language modules still consume text and image tokens, but the action head is a new modality (it consumes joint actions of robots). The action head also uses non classification losses: either an $\ell_1$ regression head (e.g., VLA Adapter) or a flow matching generative head (e.g., VLANeXt). These two design choices, the action modality and the regression style loss, are tightly coupled by construction.

We measure the spectral structure of each module’s gradient via the effective rank (erank) of $\mathbf{G} \in \mathbb{R}^{m \times n}$:

\[\mathrm{erank}(\mathbf{G}) \,\triangleq\, \exp\!\Big( H(\mathbf{p}) \Big), \quad H(\mathbf{p}) = -\sum_{i=1}^n p_i \log p_i, \quad p_i = \frac{\sigma_i(\mathbf{G})}{\sum_j \sigma_j(\mathbf{G})}.\]

A higher erank means gradient energy is spread across many singular directions; a lower erank means it concentrates in a few dominant ones.

Per module erank along training — Figure 2. Per module gradient erank along the training trajectory of VLA Adapter on LIBERO Object. Vision (top) is highest, language (middle) is intermediate, action (bottom) is consistently the lowest.

The ordering is stable across training: vision is highest, language is intermediate, the action gradient is consistently the lowest. Both the modality switch (rich pixels and tokens become a tiny continuous action vector) and the loss switch (one hot classification becomes regression or generative matching) push the action gradient toward low effective rank.

When Muon is applied uniformly to such a low erank gradient, it lifts the noisy tail directions to the same magnitude as the few informative leading directions. The resulting update is dominated by spectral floor noise.

VLA optimizer comparison — Figure 3. Test success rate on LIBERO Object at 4.5k training steps with vision and language fixed at AdamW. Muon underperforms AdamW on the action head; Low Rank Muon helps but pays roughly 15x training cost.

A natural workaround is Low Rank Muon (LRMuon), which projects the momentum onto a top $k$ subspace via SVD or Gaussian sketching before NS. LRMuon does close the accuracy gap, but the explicit projection inflates wall clock by about an order of magnitude, and forces a fixed rank $k$ that cannot adapt across layers and training steps.

Limitation 1 (modality + loss). Conventional Muon is not adaptive to the rank heterogeneity introduced by new modalities and non classification losses. Explicit low rank projection helps but breaks scalability.

Axis 3: A Different Learning Paradigm (RLVR)

RLVR keeps the LLM and the text modality; it changes the learning paradigm. Instead of a token level supervised loss, the policy is updated by policy gradient against a rule based, verifiable reward. We use GRPO (group relative policy optimization) and GMPO as representative algorithms.

To compare paradigms, we measure the per step gradient signal to noise ratio of a layer’s weight matrix:

\[\mathrm{SNR}(\mathbf{G}) \,\triangleq\, \frac{\|\mathbb{E}[\mathbf{G}]\|_F^2}{\mathbb{E}\big[\,\|\mathbf{G} - \mathbb{E}[\mathbf{G}]\|_F^2\,\big]}.\]

A higher SNR means a cleaner gradient signal.

SFT vs GRPO SNR — Figure 4. Gradient SNR of SFT versus GRPO under AdamW on Qwen3 1.7B (MATH levels 3 to 5). GRPO has substantially lower SNR than SFT throughout training.

GRPO has much lower SNR than SFT throughout training. Two structural reasons:

Coarser supervision granularity. SFT receives token level teacher signals, GRPO uses trajectory level rewards, so each token gets a much sparser learning signal.
Stabilization mechanisms. Importance sampling, clipping, and group relative normalization reweight or zero out parts of the per token gradients, which inflates variance.

When Muon is run on top of these low SNR gradients, the uniform whitening lifts the variance of the noisy directions to the same magnitude as the informative ones. The policy collapses within a few steps.

MATH500 accuracy AdamW vs Muon — Figure 5. MATH500 accuracy of GRPO on Qwen3 1.7B with AdamW versus Muon. AdamW improves steadily; Muon collapses to near zero accuracy within a few steps.

A second issue specific to the RLVR paradigm is that Muon’s NS treats each weight matrix as a single block. Attention projections inherit per head specialization from pretraining, so different heads have different Frobenius norms and want updates at different scales. Muon ignores this structure and applies one orthogonalization to the whole projection.

Limitation 2 (paradigm). Muon’s uniform spectral whitening amplifies noisy directions in low SNR RLVR gradients, making it ill suited for noise sensitive post training. It also discards the per head structure inherited from pretraining.

A Unified Spectral Diagnosis

Although Limitations 1 and 2 originate from different sources (low effective rank along the modality and loss axes, low SNR along the paradigm axis), they share one spectral signature. In the SVD of $\mathbf{M}_t$, the few leading singular values carry the informative descent direction, while the long tail of small singular values is dominated by noise: a spectral floor when erank is low, stochastic estimation noise when SNR is low. Muon’s $\mathrm{msign}$ lifts the tail to the magnitude of the head and corrupts the update in both regimes.

The natural remedy is a spectral high pass: anchor the informative head near one, contract the noisy tail toward zero. This is exactly what Pion realizes.

Pion: A High Pass Newton Schulz Iteration

Since each NS step reshapes $\sigma \in [0, 1]$ through the scalar polynomial $f(\sigma; a, b, c) = a\sigma + b\sigma^3 + c\sigma^5$, designing an NS iteration reduces to designing $f$. A single such polynomial cannot produce a sharp high pass on the unit interval, so Pion splits the default $k = 5$ NS steps into two stages with different coefficients:

a Promotion polynomial $f_{\mathrm{p}}$ applied for $k_{\mathrm{p}}$ steps, which lifts dominant singular values toward one while preserving their relative order;
a Suppression polynomial $f_{\mathrm{s}}$ applied for $k_{\mathrm{s}} = k - k_{\mathrm{p}}$ steps, which pins large singular values near one and contracts smaller ones toward zero.

The cutoff is controlled by the single hyperparameter $k_{\mathrm{p}} \in \{0, 1, \ldots, 5\}$.

The Promotion Stage

We require three constraints on $f_{\mathrm{p}}$:

(P1) Fixed point. $f_{\mathrm{p}}(1) = 1$, so any direction already at one stays there.
(P2) First order stationarity. $f_{\mathrm{p}}'(1) = 0$, so the anchor at one is flat.
(P3) Boundary concavity. $f_{\mathrm{p}}''(1) \leq 0$ together with (P2) ensures $\sigma = 1$ is a maximum so the iteration does not curve upward past one near the boundary.

Solving (P1)(P2) leaves a one parameter family. Combining (P3) with monotonicity on $[0, 1]$ carves out the feasible interval $a_{\mathrm{p}} \in [0, 1.875]$. Since $f_{\mathrm{p}}'(0) = a_{\mathrm{p}}$ controls how strongly each step lifts small singular values, we pick the largest feasible slope:

\[f_{\mathrm{p}}(\sigma) = 1.875\, \sigma \,-\, 1.25\, \sigma^3 \,+\, 0.375\, \sigma^5.\]

A pleasant byproduct is that the derivative becomes a perfect square,

\[f_{\mathrm{p}}'(\sigma) = 1.875\, (1 - \sigma^2)^2 \,\geq\, 0,\]

so monotonicity on $[0, 1]$ holds automatically.

The Suppression Stage

The suppression polynomial inherits $f_{\mathrm{s}}(1) = 1$ and $f_{\mathrm{s}}'(1) = 0$, and adds a spectral filtering condition $f_{\mathrm{s}}'(0) = 0$. Removing the linear term near the origin forces small singular values to be driven to zero by the higher order terms. The unique solution is

\[f_{\mathrm{s}}(\sigma) = 2.5\, \sigma^3 \,-\, 1.5\, \sigma^5.\]

Putting It Together

Chaining $k_{\mathrm{p}}$ Promotion steps with $k_{\mathrm{s}}$ Suppression steps gives Pion’s high pass NS iteration. Fixing $k = 5$ preserves Muon’s per step cost. Panel (d) of Figure 1 shows the resulting profile: a sharp transition between a pinned region near one and a filtered region near zero, with $k_{\mathrm{p}}$ controlling the cutoff. Empirically, suppression dominant allocations with $k_{\mathrm{s}} \geq 3$ work best for both VLA and RLVR.

Takeaway. Pion is a drop in replacement for Muon’s NS iteration. It uses the same control flow, the same per step cost, and changes only the polynomial coefficients.

Per Head Mode for RLVR

Pion has two application modes:

Default mode applies the iteration to each weight matrix as a single block (mirrors Muon).
Per head mode first reshapes each attention projection along its head dimension into multiple per head sub matrices, then runs the iteration independently on each.

Since the inner products $\mathbf{X}^\top \mathbf{X}$ are already batched along the head dimension after the reshape, the per head mode is free over the default mode.

We use the default mode for VLA (the action head is trained from scratch and has no head structure to preserve) and the per head mode for RLVR. RLVR starts from a pretrained LLM whose attention layers have heterogeneous per head Frobenius norms, and these per head norms govern attention sharpness and gradient magnitudes. Default mode Pion gives an essentially flat update variance across heads, while the per head mode preserves the heterogeneous, layer dependent updates required to keep that pretrained structure.

Per head ablation — Figure 6. MATH500 accuracy of AdamW, Muon (default vs per head), and Pion (default vs per head) on Qwen3 1.7B with GRPO on MATH levels 3 to 5. Per head Pion is the only configuration that beats AdamW. Per head mode does not save Muon, since the lack of noise adaptiveness remains the primary issue.

This isolates two complementary findings:

The spectral high pass is the primary driver of RLVR stability. Per head Muon (the same reshape on top of Muon’s NS) still collapses.
The per head reshape is an auxiliary mechanism that preserves head heterogeneity inherited from pretraining, and it pushes per head Pion past AdamW.

Experiments

Pion is evaluated on two testbeds that span all three axes:

VLA training (axes 1 and 2) with two architectures: $\ell_1$ regression based VLA Adapter and flow matching based VLANeXt, on LIBERO and LIBERO Plus.
RLVR post training (axis 3) with GRPO and GMPO on Qwen3 1.7B and Qwen3 4B over MATH and GSM8K.

Three optimizer configurations are compared in each setting: AdamW, Muon, and Pion.

VLA: Pion Outperforms Muon Across Both Heads

LIBERO four tasks — (a) Final success rate on LIBERO

Object training curve — (b) Success rate vs training steps on Object

Figure 7. AdamW, Muon, and Pion for VLA Adapter on LIBERO.

On VLA Adapter / LIBERO Object, Pion reaches 95.4% success at 500 steps and saturates at 100% by 1500 steps. AdamW needs many more steps and Muon stays consistently behind. Pion outperforms Muon on every one of the four LIBERO suites.

For VLANeXt (flow matching), Pion not only wins on the clean LIBERO benchmark but also amplifies its advantage on the harder LIBERO Plus split, especially under language ($+9$ pts), noise ($+6$ pts), and robot ($+6$ pts) perturbations. This suggests that the high pass mechanism produces more robust policies under distribution shift, consistent with the picture that uniform whitening overamplifies non generalizable noise directions.

Optimizer	LIBERO	LIBERO Plus	Background	Camera	Language	Layout	Light	Noise	Robot
AdamW	79.45	64.57	68.97	70.38	54.50	61.80	76.35	66.37	47.04
Muon	93.65	72.34	82.72	68.00	77.53	76.21	86.17	69.98	57.36
Pion (Ours)	96.35	75.93	84.53	70.88	86.93	76.71	90.67	76.09	63.18

Table 1. AdamW, Muon, and Pion for VLANeXt on LIBERO and LIBERO Plus. Best in bold.

RLVR: Pion Succeeds While Muon Collapses

Across eight RLVR settings (GRPO and GMPO times Qwen3 1.7B and Qwen3 4B times MATH and GSM8K), Muon consistently collapses to near zero accuracy. Pion not only recovers a meaningful training signal but also outperforms AdamW with faster convergence.

GRPO MATH 1.7B — (a) GRPO, Qwen3 1.7B, MATH

GRPO GSM8K 1.7B — (b) GRPO, Qwen3 1.7B, GSM8K

Figure 8. Validation accuracy versus training step on RLVR. Muon collapses; Pion outperforms AdamW.

Reverse Ablation: Direction of Spectral Shaping Matters

To verify that the gains come specifically from the high pass direction, we construct Low pass Muon (LPMuon), which mirrors Pion in NS structure and per step cost but flips the polynomial coefficients to induce a low pass mapping (contracts large singular values, amplifies small ones). LPMuon fails to train: its accuracy stays at the initial checkpoint.

(a) Low pass profile

Figure 9. (a) Scalar map $$f(\sigma)$$ of LPMuon. (b) GSM8K accuracy of AdamW, Pion, and LPMuon (Qwen3 1.7B, GRPO). Combined with Muon's failure (no filtering) in Figure 8, this isolates the direction of spectral shaping as the key factor.

Conclusion

Muon’s uniform spectral whitening is a great inductive bias for LLM pretraining, where the input is text, the loss is classification, and the paradigm is supervised learning. Once we move along any of these three axes (vision and action modalities, regression and generative losses, or reinforcement learning paradigms), the gradient becomes either low rank or low SNR, and uniform whitening starts amplifying noise rather than informative signal.

Pion replaces uniform whitening with a spectral high pass, realized as a two stage Promotion plus Suppression NS iteration. The cost stays identical to Muon, the only knob is the step allocation between the two stages, and an optional per head reshape preserves the attention head heterogeneity inherited from pretraining. Across VLA training on LIBERO and LIBERO Plus and RLVR post training on Qwen3 1.7B and Qwen3 4B over MATH and GSM8K, Pion consistently beats AdamW and Muon, including settings where Muon collapses to zero.

We see this as evidence that matrix aware optimization beyond LLM pretraining benefits from spectral filtering rather than uniform whitening.

BibTeX

@article{fan2026pion,
  title   = {Pion: Rethinking Muon Beyond Pretraining via a Spectral High Pass Optimizer},
  author  = {Fan, Chongyu and Liu, Gaowen and Hong, Mingyi and Kompella, Ramana Rao and Liu, Sijia},
  journal = {Preprint},
  year    = {2026}
}