<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.5">Jekyll</generator><link href="https://chongyu-fan.netlify.app/feed.xml" rel="self" type="application/atom+xml" /><link href="https://chongyu-fan.netlify.app/" rel="alternate" type="text/html" /><updated>2026-05-23T10:30:21-07:00</updated><id>https://chongyu-fan.netlify.app/feed.xml</id><title type="html">Chongyu Fan</title><subtitle>Ph.D student</subtitle><author><name>Chongyu Fan</name><email>chongyu.fan93@gmail.com</email></author><entry><title type="html">Rethinking Muon Beyond Pretraining: Spectral Failures and High Pass Remedies for VLA and RLVR</title><link href="https://chongyu-fan.netlify.app/posts/pion/" rel="alternate" type="text/html" title="Rethinking Muon Beyond Pretraining: Spectral Failures and High Pass Remedies for VLA and RLVR" /><published>2026-05-15T00:00:00-07:00</published><updated>2026-05-15T00:00:00-07:00</updated><id>https://chongyu-fan.netlify.app/posts/pion</id><content type="html" xml:base="https://chongyu-fan.netlify.app/posts/pion/"><![CDATA[<div class="post-lang post-lang-en">

  <div class="tldr">

    <h2 id="tldr">TL;DR</h2>

    <ul>
      <li><strong>Muon</strong> is the de facto matrix aware optimizer for <strong>LLM pretraining</strong>, which is basically next token classification on text via supervised learning. Muon orthogonalizes the momentum matrix through a few Newton Schulz (NS) iterations, pushing every singular value of the momentum to one.</li>
      <li>When we move beyond LLM pretraining along three axes (a different <strong>modality</strong>, a different <strong>loss</strong>, or a different <strong>learning paradigm</strong>), Muon’s uniform spectral whitening turns out to be the wrong inductive bias.</li>
      <li>We propose <strong>Pion</strong> (s<strong>P</strong>ectral h<strong>I</strong>gh pass <strong>O</strong>ptimization on mome<strong>N</strong>tum), a drop in replacement for Muon’s NS iteration. It changes only the polynomial coefficients used inside NS, keeps the same per step cost, and realizes a sharp spectral high pass that anchors the informative leading singular values at one while suppressing the noisy tail toward zero.</li>
    </ul>

  </div>

  <h2 id="try-it-where-does-a-singular-value-go">Try It: Where Does a Singular Value Go?</h2>

  <div class="pion-demo" data-muon-labels="whitened to 1|lifted from zero|overshoots 1" data-pion-labels="kept (passes the high pass)|filtered (suppressed)|kept (passes the high pass)">
  <div class="pion-demo-body">
    <div class="pion-demo-controls">
      <div class="pion-demo-intro">
        Pick any <span class="kbd">&sigma; &isin; [0, 1]</span> and watch the same <span class="kbd">&sigma;</span> reshaped by Muon's NS iteration (5 steps) and Pion's high pass NS (<span class="kbd">k<sub>p</sub> = 1, k<sub>s</sub> = 4</span>). Muon tries to push every <span class="kbd">&sigma;</span> toward 1; Pion keeps the head, drops the tail.
      </div>
      <div class="pion-demo-slider-row">
        <span class="sigma-label">&sigma;</span>
        <input type="range" min="0" max="1" step="0.001" value="0.35" class="pion-demo-slider" aria-label="input singular value sigma" />
        <span class="pion-demo-sigma">0.350</span>
      </div>
      <div class="pion-demo-outputs">
        <div class="pion-demo-output muon">
          <div class="out-label">Muon (NS)</div>
          <div class="out-result-wrap">
            <div class="out-divider"></div>
            <div class="out-result">
              <span class="out-input">&sigma; = <span class="out-sigma-in">0.350</span></span>
              <span class="out-arrow-sym">&#x21A6;</span>
              <span class="out-value">&mdash;</span>
            </div>
            <div class="out-tag">&nbsp;</div>
          </div>
        </div>
        <div class="pion-demo-output pion">
          <div class="out-label">Pion (High-pass NS)</div>
          <div class="out-result-wrap">
            <div class="out-divider"></div>
            <div class="out-result">
              <span class="out-input">&sigma; = <span class="out-sigma-in">0.350</span></span>
              <span class="out-arrow-sym">&#x21A6;</span>
              <span class="out-value">&mdash;</span>
            </div>
            <div class="out-tag">&nbsp;</div>
          </div>
        </div>
      </div>
    </div>
    <div class="pion-demo-plot-wrap">
      <svg class="pion-demo-plot" xmlns="http://www.w3.org/2000/svg" aria-hidden="true"></svg>
      <div class="pion-demo-legend">
        <span class="lg lg-muon"><span class="swatch"></span>Muon</span>
        <span class="lg lg-pion"><span class="swatch"></span>Pion</span>
      </div>
    </div>
  </div>
</div>

  <h2 id="background">Background</h2>

  <h3 id="muon">Muon</h3>

  <p>Muon is a <strong>matrix aware</strong> optimizer that has gained wide adoption in <strong>LLM pretraining</strong>. Its entire construction is built around a single observation: treating the momentum as a matrix rather than a flat vector, the natural notion of steepest descent under the spectral norm orthogonalizes the momentum’s singular vectors and pushes every nonzero singular value to one.</p>

  <p>For a weight matrix \(\boldsymbol{\Theta} \in \mathbb{R}^{m \times n}\), given a stochastic gradient \(\mathbf{G}_t\) and a momentum buffer \(\mathbf{M}_t = \mu \mathbf{M}_{t-1} + \mathbf{G}_t\), Muon performs the steepest descent step under the spectral norm:</p>

\[\boldsymbol{\Theta}_t = \boldsymbol{\Theta}_{t-1} - \eta \, \mathrm{msign}(\mathbf{M}_t).\]

  <p>If \(\mathbf{M} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top\) is the compact SVD, then</p>

\[\mathrm{msign}(\mathbf{M}) = \mathbf{U}\, \mathrm{sign}(\boldsymbol{\Sigma})\, \mathbf{V}^\top = \mathbf{U} \mathbf{V}^\top.\]

  <p>Every nonzero singular value is mapped to one. This is <strong>uniform spectral whitening</strong>. Computing an SVD per step is too expensive at scale, so Muon approximates \(\mathrm{msign}\) by a small number of Newton Schulz (NS) iterations. After normalizing the input as \(\mathbf{X} \leftarrow \mathbf{X} / (\|\mathbf{X}\|_F + \epsilon)\), each NS step applies an odd quintic matrix polynomial,</p>

\[\mathbf{X} \leftarrow a\, \mathbf{X} + b\, \mathbf{X}\mathbf{X}^\top \mathbf{X} + c\, \mathbf{X}(\mathbf{X}^\top \mathbf{X})^2,\]

  <p>with the canonical coefficients \((a, b, c) = (3.4445,\ -4.7750,\ 2.0315)\). By the identity \(\mathbf{X}(\mathbf{X}^\top \mathbf{X})^j = \mathbf{U}\, \boldsymbol{\Sigma}^{2j+1}\, \mathbf{V}^\top\), an NS step preserves the singular vectors and reshapes each singular value through a <strong>scalar polynomial</strong> on \([0, 1]\):</p>

\[f(\sigma;\, a, b, c) \,\triangleq\, a\sigma + b\sigma^3 + c\sigma^5.\]

  <p>So designing an NS iteration reduces to designing \(f\) on \([0, 1]\). Muon’s NS is constructed so that repeated application drives every \(\sigma \in (0, 1]\) toward one. The shape of this scalar map (and Pion’s eventual replacement for it) is the central object we visualize later in <a href="#fig-1">Figure 1</a>.</p>

  <h3 id="three-axes-beyond-pretraining">Three Axes Beyond Pretraining</h3>

  <p>LLM pretraining is typically optimized with a next token prediction loss; more concretely, the task is <strong>classification</strong>, the modality is <strong>text only</strong>, and the paradigm is <strong>supervised learning</strong>. When per token supervision is dense and accurate, pushing every singular value to one is a reasonable default. But LLM pretraining is only one part of deep learning, and how Muon behaves along different <strong>modalities</strong>, <strong>losses</strong>, and <strong>learning paradigms</strong> remains an open question worth exploring.</p>

  <div class="axes-anim">
  <div class="axes-anim-head">
    <div class="axes-anim-title">
      <span class="badge">Three axes</span>
      Three orthogonal steps out of LLM pretraining
    </div>
    <div class="axes-anim-sub">
      LLM pretraining sits in one corner of design space:
      <span class="corner">text</span>
      &nbsp;&plus;&nbsp;
      <span class="corner">classification</span>
      &nbsp;&plus;&nbsp;
      <span class="corner">supervised learning</span>.
      Each axis below pushes us out of that corner in a different direction.
    </div>
  </div>

  <div class="axis-track">
    <div class="axis-tag">
      <span class="axis-num">1</span>
      <div>
        <div class="axis-name">Modality</div>
        <div class="axis-hint">language &nbsp;→&nbsp; vision / action</div>
      </div>
    </div>
    <div class="axis-from">
      <div class="ax-tok-row">
        <span class="ax-tok">the</span><span class="ax-tok">cat</span><span class="ax-tok">sits</span><span class="ax-tok">on</span><span class="ax-tok">mat</span>
      </div>
      <div class="ax-cell-label">text tokens</div>
    </div>
    <div class="axis-arrow"><div class="track"><span class="pulse p1"></span><span class="pulse p2"></span><span class="pulse p3"></span></div><div class="head"></div></div>
    <div class="axis-to">
      <div class="ax-mod-to">
        <div class="ax-mod-pix" aria-label="image patches">
          <span style="background:#a5b4fc"></span><span style="background:#c7d2fe"></span><span style="background:#818cf8"></span>
          <span style="background:#c7d2fe"></span><span style="background:#a5b4fc"></span><span style="background:#818cf8"></span>
          <span style="background:#818cf8"></span><span style="background:#a5b4fc"></span><span style="background:#c7d2fe"></span>
        </div>
        <svg class="ax-mod-arm" viewBox="0 0 48 48" aria-hidden="true" fill="none">
          <ellipse cx="24" cy="45" rx="12" ry="1.2" fill="rgba(15,23,42,0.10)" />
          <rect x="12" y="40.5" width="24" height="4" rx="1.5" fill="#4485C7" />
          <rect x="22" y="31.5" width="4" height="10" fill="#4485C7" />
          <line x1="24" y1="31.5" x2="11" y2="21" stroke="#4485C7" stroke-width="4" stroke-linecap="round" />
          <circle cx="24" cy="31.5" r="3.2" fill="#4485C7" />
          <circle cx="24" cy="31.5" r="1.3" fill="#fff" />
          <line x1="11" y1="21" x2="33" y2="10" stroke="#4485C7" stroke-width="4" stroke-linecap="round" />
          <circle cx="11" cy="21" r="2.8" fill="#4485C7" />
          <circle cx="11" cy="21" r="1.1" fill="#fff" />
          <circle cx="33" cy="10" r="2.3" fill="#4485C7" />
          <path d="M33 10 L39 6 L41 8" stroke="#4485C7" stroke-width="2.2" stroke-linecap="round" stroke-linejoin="round" />
          <path d="M33 10 L39 14 L41 12" stroke="#4485C7" stroke-width="2.2" stroke-linecap="round" stroke-linejoin="round" />
        </svg>
      </div>
      <div class="ax-cell-label">image patches / robot actions</div>
    </div>
  </div>

  <div class="axis-track">
    <div class="axis-tag">
      <span class="axis-num">2</span>
      <div>
        <div class="axis-name">Loss</div>
        <div class="axis-hint">classification &nbsp;→&nbsp; regression</div>
      </div>
    </div>
    <div class="axis-from">
      <svg class="ax-onehot" viewBox="0 0 130 44" aria-hidden="true">
        <line x1="0" y1="20" x2="130" y2="20" stroke="#e5e9ef" stroke-width="0.8" />
        <rect x="6" y="16" width="14" height="4" fill="#93BBE0" rx="1" />
        <rect x="24" y="13" width="14" height="7" fill="#93BBE0" rx="1" />
        <rect x="42" y="10" width="14" height="10" fill="#93BBE0" rx="1" />
        <rect x="60" y="4" width="14" height="16" fill="#4485C7" rx="1" />
        <rect x="78" y="11" width="14" height="9" fill="#93BBE0" rx="1" />
        <rect x="96" y="14" width="14" height="6" fill="#93BBE0" rx="1" />
        <rect x="114" y="16" width="10" height="4" fill="#93BBE0" rx="1" />
        <line x1="0" y1="40" x2="130" y2="40" stroke="#e5e9ef" stroke-width="1" />
        <rect x="6" y="37.5" width="14" height="2.5" fill="#cbd5e1" rx="1" />
        <rect x="24" y="37.5" width="14" height="2.5" fill="#cbd5e1" rx="1" />
        <rect x="42" y="37.5" width="14" height="2.5" fill="#cbd5e1" rx="1" />
        <rect x="60" y="23" width="14" height="17" fill="#4485C7" rx="1" />
        <rect x="78" y="37.5" width="14" height="2.5" fill="#cbd5e1" rx="1" />
        <rect x="96" y="37.5" width="14" height="2.5" fill="#cbd5e1" rx="1" />
        <rect x="114" y="38" width="10" height="2" fill="#cbd5e1" rx="1" />
      </svg>
      <div class="ax-cell-label">next-token prediction</div>
    </div>
    <div class="axis-arrow"><div class="track"><span class="pulse p1"></span><span class="pulse p2"></span><span class="pulse p3"></span></div><div class="head"></div></div>
    <div class="axis-to">
      <svg class="ax-curve" viewBox="0 0 130 44" aria-hidden="true">
        <line x1="0" y1="40" x2="130" y2="40" stroke="rgba(68,133,199,0.4)" stroke-width="1" />
        <line x1="8" y1="34" x2="124" y2="9" stroke="#4485C7" stroke-width="1.8" stroke-linecap="round" />
        <circle cx="14" cy="33" r="1.8" fill="#4485C7" />
        <circle cx="28" cy="29" r="1.8" fill="#4485C7" />
        <circle cx="42" cy="28" r="1.8" fill="#4485C7" />
        <circle cx="56" cy="22" r="1.8" fill="#4485C7" />
        <circle cx="70" cy="22" r="1.8" fill="#4485C7" />
        <circle cx="84" cy="16" r="1.8" fill="#4485C7" />
        <circle cx="100" cy="14" r="1.8" fill="#4485C7" />
        <circle cx="116" cy="11" r="1.8" fill="#4485C7" />
      </svg>
      <div class="ax-cell-label">continuous output</div>
    </div>
  </div>

  <div class="axis-track">
    <div class="axis-tag">
      <span class="axis-num">3</span>
      <div>
        <div class="axis-name">Learning paradigm</div>
        <div class="axis-hint">supervised learning &nbsp;→&nbsp; reinforcement learning</div>
      </div>
    </div>
    <div class="axis-from">
      <div class="ax-tape">
        <div class="row">
          <span class="ax-tok">x<sub>1</sub></span><span class="ax-tok">x<sub>2</sub></span><span class="ax-tok">x<sub>3</sub></span><span class="ax-tok">x<sub>4</sub></span><span class="ax-tok">x<sub>5</sub></span>
        </div>
        <div class="row">
          <span class="arr dense">↓</span><span class="arr dense">↓</span><span class="arr dense">↓</span><span class="arr dense">↓</span><span class="arr dense">↓</span>
        </div>
      </div>
      <div class="ax-cell-label">per-token teacher signal</div>
    </div>
    <div class="axis-arrow"><div class="track"><span class="pulse p1"></span><span class="pulse p2"></span><span class="pulse p3"></span></div><div class="head"></div></div>
    <div class="axis-to">
      <div class="ax-tape">
        <div class="row">
          <span class="ax-tok">x<sub>1</sub></span><span class="ax-tok">x<sub>2</sub></span><span class="ax-tok">x<sub>3</sub></span><span class="ax-tok">x<sub>4</sub></span><span class="ax-tok">x<sub>5</sub></span>
        </div>
        <div class="row">
          <span class="arr dot">·</span><span class="arr dot">·</span><span class="arr dot">·</span><span class="arr dot">·</span><span class="arr reward">↓</span>
        </div>
      </div>
      <div class="ax-cell-label">trajectory-level reward</div>
    </div>
  </div>
</div>

  <h2 id="motivation">Motivation</h2>

  <p>We consider two settings: VLA and RLVR. VLA simultaneously changes the modality and the loss (introducing the vision and action modalities and replacing classification with regression loss); RLVR changes only the learning paradigm (replacing supervised learning with reinforcement learning). Through gradient rank and signal-to-noise ratio analyses, we find that Muon transfers poorly to both the action modality and reinforcement learning.</p>

  <h3 id="beyond-modality-and-loss-vla">Beyond Modality and Loss: VLA</h3>

  <p>A VLA model is factorized into a <strong>vision encoder</strong>, a <strong>language backbone</strong>, and an <strong>action head</strong>. The vision and language modules take text instructions and images as input, while the <strong>action head</strong> is a new modality whose output is the robot’s actions. Correspondingly, the action head is trained with a non-classification loss: either \(\ell_1\) regression (e.g., VLA Adapter) or flow matching (e.g., VLANeXt). The “action modality” and “regression-style loss” choices are tightly coupled by construction.</p>

  <p>We measure the spectral structure of each module’s gradient via the <strong>effective rank</strong> (erank) of \(\mathbf{G} \in \mathbb{R}^{m \times n}\):</p>

\[\mathrm{erank}(\mathbf{G}) \,\triangleq\, \exp\!\Big( H(\mathbf{p}) \Big), \quad H(\mathbf{p}) = -\sum_{i=1}^n p_i \log p_i, \quad p_i = \frac{\sigma_i(\mathbf{G})}{\sum_j \sigma_j(\mathbf{G})}.\]

  <p>A higher erank means gradient energy is spread across many singular directions; a lower erank means it concentrates in a few dominant ones.</p>

  <div id="fig-2" class="fig-grid" style="grid-template-columns: repeat(3, 1fr);">
  <figure><img src="/images/blog/pion/erank_heatmap_sampled.png" alt="erank" /><figcaption>(a) Per module gradient erank</figcaption></figure>
  <figure><img src="/images/blog/pion/success_rate.png" alt="success rate" /><figcaption>(b) Test success rate</figcaption></figure>
  <figure><img src="/images/blog/pion/total_training_time.png" alt="training time" /><figcaption>(c) Total training time (hrs)</figcaption></figure>
</div>
  <p class="fig-caption">Figure 2. Limitations of Muon in VLA training (VLA Adapter on LIBERO Object). (a) Per module gradient erank along the training trajectory. (b)(c) Test success rate and total training time at 4.5k steps, with vision and language fixed at AdamW; only the action module optimizer differs.</p>

  <p>The ordering in <a href="#fig-2">Figure 2</a>(a) is stable across training: vision is the highest, language sits in the middle, and the <strong>action gradient is consistently the lowest</strong>. The intuition is twofold: vision and text inputs carry far richer information per sample, whereas an action vector only needs 7 degrees of freedom to express; on top of that, the action head is trained with a regression loss, whose output space is much smaller than the discrete-token space of language and vision, so its gradient is strongly low-rank in nature. When Muon is applied uniformly to such a low-erank gradient, it lifts the weak noise tail to the same magnitude as the few informative leading directions, and the resulting update is dominated by spectral floor noise. Consequently, <a href="#fig-2">Figure 2</a>(b) shows Muon underperforming AdamW on the action head. A natural workaround, <strong>Low-Rank Muon (LRMuon)</strong>, projects the momentum onto a top \(k\) subspace via SVD or Gaussian sketching before NS. LRMuon recovers the success rate, but <a href="#fig-2">Figure 2</a>(c) shows that the explicit projection inflates wall clock by about an order of magnitude, and forces a fixed rank \(k\) that cannot adapt across layers and steps.</p>

  <div class="callout">
    <p><strong>Limitation 1 (modality + loss).</strong> Conventional Muon does not adapt to the rank heterogeneity introduced by new modalities and non-classification losses. Explicit low-rank projection recovers the success rate but at the cost of scalability.</p>
  </div>

  <h3 id="beyond-learning-paradigm-rlvr">Beyond Learning Paradigm: RLVR</h3>

  <p>RLVR keeps the LLM and the text modality; only the <strong>learning paradigm</strong> changes: a token-level supervised loss (as in SFT) is replaced by a trajectory-level policy gradient against a verifiable reward (as in GRPO). To compare the two paradigms on the same footing, we measure the per-step gradient signal-to-noise ratio on a given layer’s weight matrix:</p>

\[\mathrm{SNR}(\mathbf{G}) \,\triangleq\, \frac{\|\mathbb{E}[\mathbf{G}]\|_F^2}{\mathbb{E}\big[\,\|\mathbf{G} - \mathbb{E}[\mathbf{G}]\|_F^2\,\big]}.\]

  <p>Two structural reasons explain the SNR gap in <a href="#fig-3">Figure 3</a>(a). First, <strong>coarser supervision granularity</strong>: SFT uses token-level teacher signals, while GRPO uses trajectory-level rewards, so each token receives a much sparser learning signal. Second, <strong>stabilization mechanisms</strong>: importance sampling, clipping, and group-relative normalization reweight or zero out parts of the per-token gradients, further inflating variance. When Muon is applied on top of these low-SNR gradients, the uniform whitening lifts the noisy directions to the same magnitude as the informative ones, and the policy collapses within a few steps, as shown in <a href="#fig-3">Figure 3</a>(b).</p>

  <div id="fig-3" class="fig-grid wrap" style="grid-template-columns: repeat(2, 1fr);">
  <figure><img src="/images/blog/pion/grad_snr_sft_vs_rl_math3-5_step80.png" alt="SFT vs GRPO SNR" /><figcaption>(a) Gradient SNR: SFT vs GRPO</figcaption></figure>
  <figure><img src="/images/blog/pion/acc_adamw_muon.png" alt="Accuracy AdamW vs Muon" /><figcaption>(b) MATH500: AdamW vs Muon</figcaption></figure>
</div>
  <p class="fig-caption">Figure 3. RLVR diagnosis on Qwen3 1.7B (MATH levels 3 to 5). (a) GRPO has substantially lower gradient SNR than SFT throughout training. (b) Under GRPO, AdamW improves steadily while Muon collapses to near zero accuracy within a few steps.</p>

  <div class="callout">
    <p><strong>Limitation 2 (learning paradigm).</strong> Muon’s uniform spectral whitening amplifies noisy directions in low-SNR RLVR gradients, making it unsuitable for noise-sensitive post-training.</p>
  </div>

  <h2 id="method">Method</h2>

  <p>Limitations 1 and 2 come from different sources (low effective rank along the modality / loss axes, low SNR along the learning paradigm axis), yet they share <strong>one spectral signature</strong>. In the SVD of \(\mathbf{M}_t\), the few <strong>leading</strong> singular values carry the informative descent direction, while the long <strong>tail</strong> of small singular values is dominated by noise: a spectral floor when erank is low, stochastic estimation noise when SNR is low. Muon’s \(\mathrm{msign}\) lifts the tail to the magnitude of the head and corrupts the update in both regimes. The natural remedy is a <strong>spectral high pass</strong>: anchor the informative head near one and contract the noisy tail toward zero.</p>

  <h3 id="high-pass-ns">High-Pass NS</h3>

  <p>Since each NS step reshapes \(\sigma \in [0, 1]\) through the scalar polynomial \(f(\sigma; a, b, c) = a\sigma + b\sigma^3 + c\sigma^5\), designing an NS iteration reduces to designing \(f\). A single such polynomial cannot produce a sharp high pass on the unit interval, so Pion splits the default \(k = 5\) NS steps into two stages with different coefficients:</p>

  <ul>
    <li>a <strong>Promotion</strong> polynomial \(f_{\mathrm{p}}\) applied for \(k_{\mathrm{p}}\) steps, which lifts dominant singular values toward one while preserving their relative order;</li>
    <li>a <strong>Suppression</strong> polynomial \(f_{\mathrm{s}}\) applied for \(k_{\mathrm{s}} = k - k_{\mathrm{p}}\) steps, which pins large singular values near one and contracts smaller ones toward zero.</li>
  </ul>

  <p>The cutoff is controlled by the single hyperparameter \(k_{\mathrm{p}} \in \{0, 1, \ldots, 5\}\).</p>

  <p>We require three constraints on \(f_{\mathrm{p}}\): <strong>(P1)</strong> fixed point \(f_{\mathrm{p}}(1) = 1\); <strong>(P2)</strong> first order stationarity \(f_{\mathrm{p}}'(1) = 0\); and <strong>(P3)</strong> boundary concavity \(f_{\mathrm{p}}''(1) \leq 0\), which together with (P2) ensures \(\sigma = 1\) is a maximum so that the iteration does not curve upward past one near the boundary. Solving (P1) and (P2) leaves a one parameter family. Combining (P3) with monotonicity on \([0, 1]\) carves out the feasible interval \(a_{\mathrm{p}} \in [0, 1.875]\). Since \(f_{\mathrm{p}}'(0) = a_{\mathrm{p}}\) controls how strongly each step lifts small singular values, we pick the largest feasible slope, which uniquely determines the polynomial:</p>

\[f_{\mathrm{p}}(\sigma) = 1.875\, \sigma \,-\, 1.25\, \sigma^3 \,+\, 0.375\, \sigma^5.\]

  <p>A pleasant byproduct is that the derivative becomes a perfect square, \(f_{\mathrm{p}}'(\sigma) = 1.875\, (1 - \sigma^2)^2 \geq 0\), so monotonicity on \([0, 1]\) holds automatically.</p>

  <p>The Suppression polynomial inherits \(f_{\mathrm{s}}(1) = 1\) and \(f_{\mathrm{s}}'(1) = 0\), and adds a <strong>spectral filtering</strong> condition \(f_{\mathrm{s}}'(0) = 0\). Removing the linear term near the origin forces small singular values to be driven to zero by the higher order terms. The unique solution is</p>

\[f_{\mathrm{s}}(\sigma) = 2.5\, \sigma^3 \,-\, 1.5\, \sigma^5.\]

  <p>Chaining \(k_{\mathrm{p}}\) Promotion steps with \(k_{\mathrm{s}}\) Suppression steps gives Pion’s <strong>high-pass NS</strong> iteration. Fixing \(k = 5\) preserves Muon’s per-step cost. <a href="#fig-1">Figure 1</a> compares Muon NS, Promotion, Suppression, and the resulting Pion high-pass NS profile: a sharp transition between a pinned region near one and a filtered region near zero, with \(k_{\mathrm{p}}\) controlling the cutoff. Empirically, <strong>suppression-dominant</strong> allocations with \(k_{\mathrm{s}} \geq 3\) work best for both VLA and RLVR.</p>

  <div id="fig-1" class="fig-grid" style="grid-template-columns: repeat(4, 1fr);">
  <figure><img src="/images/blog/pion/iter_NS.png" alt="Muon NS" /><figcaption>(a) Muon NS</figcaption></figure>
  <figure><img src="/images/blog/pion/iter_P.png" alt="Promotion" /><figcaption>(b) Promotion \(f_{\mathrm{p}}\)</figcaption></figure>
  <figure><img src="/images/blog/pion/iter_S.png" alt="Suppression" /><figcaption>(c) Suppression \(f_{\mathrm{s}}\)</figcaption></figure>
  <figure><img src="/images/blog/pion/iter_pion_mix.png" alt="High pass NS" /><figcaption>(d) Pion high pass NS</figcaption></figure>
</div>
  <p class="fig-caption">Figure 1. Visualization of \(f(\sigma)\) on \(\sigma \in [0, 1]\). Muon (a) drives every singular value toward one. Pion combines Promotion (b) with Suppression (c) to obtain the high pass profile in (d).</p>

  <h3 id="per-head-mode-for-rlvr">Per-Head Mode for RLVR</h3>

  <p>So far the high-pass NS has been applied to each per-layer momentum \(\mathbf{M}_t \in \mathbb{R}^{m \times n}\) as a single block, exactly mirroring Muon; we call this the <strong>default mode</strong>. We find that this does not transfer well to RLVR. RLVR starts from a model that has already been pretrained (or SFT’d), whose attention layers exhibit substantial per-head heterogeneity in \(\|\mathbf{W}_Q^h\|_F\), \(\|\mathbf{W}_K^h\|_F\), \(\|\mathbf{W}_V^h\|_F\), and \(\|\mathbf{W}_O^h\|_F\). This heterogeneity jointly governs the forward outputs and the backward gradients, so different heads should naturally receive updates at different scales.</p>

  <p>To respect this structure, Pion adds a <strong>per-head mode</strong> that first reshapes each attention projection along the head dimension into per-head sub-matrices and then runs the two-stage high-pass NS independently on each. Formally, with \(H\) attention heads and per-head dimension \(d_k\), each attention projection (Q / K / V / O) admits a canonical reshape along the head axis,</p>

\[\mathbf{M}_t \;\xrightarrow{\;\mathrm{Reshape}\;}\; \{\mathbf{M}_t^h\}_{h=1}^{H}, \qquad \mathbf{M}_t^h \in \mathbb{R}^{d \times d_k}.\]

  <p>The per-head mode then applies the full two-stage high-pass NS <strong>independently on each \(\mathbf{M}_t^h\)</strong>: a per-head Frobenius pre-normalization \(\mathbf{X}^h \leftarrow \mathbf{M}_t^h / (\|\mathbf{M}_t^h\|_F + \epsilon)\), followed by \(k_{\mathrm{p}}\) Promotion steps and \(k_{\mathrm{s}}\) Suppression steps, and finally a reshape of \(\{\mathbf{X}^h\}_{h=1}^H\) back to a single \(\mathbf{X} \in \mathbb{R}^{m \times n}\). Because \(\mathbf{X}^h (\mathbf{X}^h)^\top \mathbf{X}^h\) is naturally batched over \(h\) on GPU, the only extra cost over the default mode is the reshape itself.</p>

  <p><a href="#fig-4">Figure 4</a>(a) makes the default-mode failure concrete on Qwen3 1.7B: the pre-RLVR cross-head variance \(\mathrm{Var}_h(\|\mathbf{W}_{0,Q}^h\|_F)\) is non-negligible across all 28 layers (top), yet under default-mode Pion the update variance \(\mathrm{Var}_h(\|\mathbf{W}_{*,Q}^h - \mathbf{W}_{0,Q}^h\|_F)\) collapses to near zero (bottom). In other words, a single Frobenius pre-normalization plus a single NS chain over the <strong>whole</strong> projection equalizes the update scale across heads and mixes head-specific directions, so every head ends up with an almost identical update and the inter-head heterogeneity is erased. By contrast, the per-head mode restores a layer-dependent, head-specific update profile.</p>

  <div id="fig-4" class="fig-grid wrap" style="grid-template-columns: repeat(2, 1fr);">
  <figure><img src="/images/blog/pion/qwen3_1p7b_Q_headnorm_variance_compact_step80.png" alt="Q headnorm variance" /><figcaption>(a) Cross-head Q variance</figcaption></figure>
  <figure><img src="/images/blog/pion/acc_mh.png" alt="Per head ablation accuracy" /><figcaption>(b) MATH500 accuracy</figcaption></figure>
</div>
  <p class="fig-caption">Figure 4. Effect of per-head high-pass NS on RLVR (Qwen3 1.7B, GRPO on MATH levels 3 to 5). (a) Cross-head Q projection variance: pre-RLVR weight \(\mathrm{Var}_h(\|\mathbf{W}_{0,Q}^h\|_F)\) (top) and post-RLVR update \(\mathrm{Var}_h(\|\mathbf{W}_{*,Q}^h - \mathbf{W}_{0,Q}^h\|_F)\) for default vs. per-head Pion (bottom). (b) MATH500 accuracy of AdamW, Muon (default vs. per-head), and Pion (default vs. per-head).</p>

  <p>At this point, Pion’s per-head high-pass NS bundles two design choices: the <strong>spectral high pass</strong> and the <strong>per-head reshape</strong>. A natural follow-up question is which of the two is doing the heavy lifting. We find that the two are complementary but <strong>not symmetric</strong>. The <strong>spectral high pass</strong> is the <em>primary</em> driver: in <a href="#fig-4">Figure 4</a>(b), even if we apply the same reshape on top of Muon’s NS, the resulting per-head Muon still collapses, because injecting the noise tail head-by-head is just as harmful as injecting it on the whole matrix. The <strong>per-head reshape</strong> is the <em>auxiliary</em> mechanism, used to preserve the per-head heterogeneity inherited from the pretrained (or SFT’d) attention layers. We do not use per-head mode for VLA: there the action head is trained from scratch, with no pretrained multi-head attention structure to preserve.</p>

  <h3 id="algorithms">Algorithms</h3>

  <p>For reference, we write out the full procedures below. Algorithm 1 is Muon’s standard NS iteration. Pion only replaces the inner NS loop with a two stage high pass version: Algorithm 2 is the default mode used for VLA training; Algorithm 3 is the per head mode used for RLVR post training. The total iteration count is fixed to (k = 5), split into (k_{\mathrm{p}}) Promotion steps and (k_{\mathrm{s}} = k - k_{\mathrm{p}}) Suppression steps. <span class="algo-legend-inline"><span class="lg-item"><span class="swatch s-1"></span> Pion vs. Muon</span><span class="lg-item"><span class="swatch s-2"></span> per head specific</span></span></p>

  <div class="algo-row">
  <div class="algo-block compact">
    <div class="algo-header">
      <span class="algo-num">Algorithm 1</span>
      <span class="algo-title">Muon Optimizer</span>
    </div>
    <div class="algo-require">
      <span class="kw">Require:</span>&nbsp; learning rate \(\eta\), momentum coefficient \(\mu\), NS iteration count \(k = 5\)
    </div>
    <ol class="algo-steps">
      <li><span class="line">\(\mathbf{M}_0 \leftarrow \mathbf{0}\)</span></li>
      <li><span class="line"><span class="kw">for</span> \(t = 1, 2, \dots\) <span class="kw">do</span></span></li>
      <li class="ind-1"><span class="line">\(\mathbf{G}_t \leftarrow \nabla_{\boldsymbol{\Theta}} \mathcal{L}_t(\boldsymbol{\Theta}_{t-1})\)</span></li>
      <li class="ind-1"><span class="line">\(\mathbf{M}_t \leftarrow \mu\, \mathbf{M}_{t-1} + \mathbf{G}_t\)</span></li>
      <li class="ind-1"><span class="line">\(\mathbf{X} \leftarrow \mathbf{M}_t / (\lVert \mathbf{M}_t \rVert_F + \epsilon)\)</span><span class="comment">spectral pre-norm</span></li>
      <li class="ind-1"><span class="line"><span class="kw">for</span> \(i = 1, \dots, k\) <span class="kw">do</span></span><span class="comment">\((a, b, c) = (3.4445, -4.7750, 2.0315)\)</span></li>
      <li class="ind-2"><span class="line">\(\mathbf{X} \leftarrow a\mathbf{X} + b\mathbf{X}\mathbf{X}^\top\mathbf{X} + c\mathbf{X}(\mathbf{X}^\top\mathbf{X})^2\)</span></li>
      <li class="ind-1"><span class="line"><span class="kw">end for</span></span></li>
      <li class="ind-1"><span class="line">\(\boldsymbol{\Theta}_t \leftarrow \boldsymbol{\Theta}_{t-1} - \eta\, \mathbf{X}\)</span></li>
      <li><span class="line"><span class="kw">end for</span></span></li>
      <li><span class="line"><span class="kw">return</span> \(\boldsymbol{\Theta}_t\)</span></li>
    </ol>
  </div>

  <div class="algo-block compact">
    <div class="algo-header">
      <span class="algo-num">Algorithm 2</span>
      <span class="algo-title">Pion Optimizer (default: high pass NS on the whole matrix)</span>
    </div>
    <div class="algo-require">
      <span class="kw">Require:</span>&nbsp; learning rate \(\eta\), momentum coefficient \(\mu\), promotion steps \(k_{\mathrm{p}}\)
    </div>
    <ol class="algo-steps">
      <li class="algo-diff"><span class="line">\(k_{\mathrm{s}} \leftarrow 5 - k_{\mathrm{p}}\)</span><span class="comment">split \(k = 5\) into \(k_{\mathrm{p}} + k_{\mathrm{s}}\)</span></li>
      <li><span class="line">\(\mathbf{M}_0 \leftarrow \mathbf{0}\)</span></li>
      <li><span class="line"><span class="kw">for</span> \(t = 1, 2, \dots\) <span class="kw">do</span></span></li>
      <li class="ind-1"><span class="line">\(\mathbf{G}_t \leftarrow \nabla_{\boldsymbol{\Theta}} \mathcal{L}_t(\boldsymbol{\Theta}_{t-1})\)</span></li>
      <li class="ind-1"><span class="line">\(\mathbf{M}_t \leftarrow \mu\, \mathbf{M}_{t-1} + \mathbf{G}_t\)</span></li>
      <li class="ind-1"><span class="line">\(\mathbf{X} \leftarrow \mathbf{M}_t / (\lVert \mathbf{M}_t \rVert_F + \epsilon)\)</span><span class="comment">spectral pre-norm</span></li>
      <li class="ind-1 algo-diff"><span class="line"><span class="kw">for</span> \(i = 1, \dots, k_{\mathrm{p}}\) <span class="kw">do</span></span><span class="comment">stage 1: Promotion, \((a_{\mathrm{p}}, b_{\mathrm{p}}, c_{\mathrm{p}}) = (1.875, -1.25, 0.375)\)</span></li>
      <li class="ind-2 algo-diff"><span class="line">\(\mathbf{X} \leftarrow a_{\mathrm{p}}\mathbf{X} + b_{\mathrm{p}}\mathbf{X}\mathbf{X}^\top\mathbf{X} + c_{\mathrm{p}}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^2\)</span></li>
      <li class="ind-1"><span class="line"><span class="kw">end for</span></span></li>
      <li class="ind-1 algo-diff"><span class="line"><span class="kw">for</span> \(j = 1, \dots, k_{\mathrm{s}}\) <span class="kw">do</span></span><span class="comment">stage 2: Suppression, \((a_{\mathrm{s}}, b_{\mathrm{s}}, c_{\mathrm{s}}) = (0, 2.5, -1.5)\)</span></li>
      <li class="ind-2 algo-diff"><span class="line">\(\mathbf{X} \leftarrow a_{\mathrm{s}}\mathbf{X} + b_{\mathrm{s}}\mathbf{X}\mathbf{X}^\top\mathbf{X} + c_{\mathrm{s}}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^2\)</span></li>
      <li class="ind-1 algo-diff"><span class="line"><span class="kw">end for</span></span></li>
      <li class="ind-1"><span class="line">\(\boldsymbol{\Theta}_t \leftarrow \boldsymbol{\Theta}_{t-1} - \eta\, \mathbf{X}\)</span></li>
      <li><span class="line"><span class="kw">end for</span></span></li>
      <li><span class="line"><span class="kw">return</span> \(\boldsymbol{\Theta}_t\)</span></li>
    </ol>
  </div>

  <div class="algo-block compact">
    <div class="algo-header">
      <span class="algo-num">Algorithm 3</span>
      <span class="algo-title">Pion Optimizer (per head: high pass NS per attention head)</span>
    </div>
    <div class="algo-require">
      <span class="kw">Require:</span>&nbsp; learning rate \(\eta\), momentum coefficient \(\mu\), promotion steps \(k_{\mathrm{p}}\), heads \(H\)
    </div>
    <ol class="algo-steps">
      <li class="algo-diff"><span class="line">\(k_{\mathrm{s}} \leftarrow 5 - k_{\mathrm{p}}\)</span><span class="comment">split \(k = 5\) into \(k_{\mathrm{p}} + k_{\mathrm{s}}\)</span></li>
      <li><span class="line">\(\mathbf{M}_0 \leftarrow \mathbf{0}\)</span></li>
      <li><span class="line"><span class="kw">for</span> \(t = 1, 2, \dots\) <span class="kw">do</span></span></li>
      <li class="ind-1"><span class="line">\(\mathbf{G}_t \leftarrow \nabla_{\boldsymbol{\Theta}} \mathcal{L}_t(\boldsymbol{\Theta}_{t-1})\)</span></li>
      <li class="ind-1"><span class="line">\(\mathbf{M}_t \leftarrow \mu\, \mathbf{M}_{t-1} + \mathbf{G}_t\)</span></li>
      <li class="ind-1 algo-diff-2"><span class="line">\(\{\mathbf{M}_t^h\}_{h=1}^{H} \leftarrow \mathrm{Reshape}(\mathbf{M}_t)\)</span><span class="comment">split attention along head dim</span></li>
      <li class="ind-1 algo-diff-2"><span class="line">\(\mathbf{X}^h \leftarrow \mathbf{M}_t^h / (\lVert \mathbf{M}_t^h \rVert_F + \epsilon),\ \forall\, h\)</span><span class="comment">per head pre-norm</span></li>
      <li class="ind-1 algo-diff"><span class="line"><span class="kw">for</span> \(i = 1, \dots, k_{\mathrm{p}}\) <span class="kw">do</span></span><span class="comment">stage 1: Promotion, batched over \(H\)</span></li>
      <li class="ind-2 algo-diff"><span class="line">\(\mathbf{X}^h \leftarrow a_{\mathrm{p}}\mathbf{X}^h + b_{\mathrm{p}}\mathbf{X}^h(\mathbf{X}^h)^\top\mathbf{X}^h + c_{\mathrm{p}}\mathbf{X}^h\bigl((\mathbf{X}^h)^\top\mathbf{X}^h\bigr)^2\)</span></li>
      <li class="ind-1"><span class="line"><span class="kw">end for</span></span></li>
      <li class="ind-1 algo-diff"><span class="line"><span class="kw">for</span> \(j = 1, \dots, k_{\mathrm{s}}\) <span class="kw">do</span></span><span class="comment">stage 2: Suppression, batched over \(H\)</span></li>
      <li class="ind-2 algo-diff"><span class="line">\(\mathbf{X}^h \leftarrow a_{\mathrm{s}}\mathbf{X}^h + b_{\mathrm{s}}\mathbf{X}^h(\mathbf{X}^h)^\top\mathbf{X}^h + c_{\mathrm{s}}\mathbf{X}^h\bigl((\mathbf{X}^h)^\top\mathbf{X}^h\bigr)^2\)</span></li>
      <li class="ind-1 algo-diff"><span class="line"><span class="kw">end for</span></span></li>
      <li class="ind-1 algo-diff-2"><span class="line">\(\mathbf{X} \leftarrow \mathrm{Reshape}^{-1}(\{\mathbf{X}^h\}_{h=1}^{H})\)</span><span class="comment">rejoin per head matrices</span></li>
      <li class="ind-1"><span class="line">\(\boldsymbol{\Theta}_t \leftarrow \boldsymbol{\Theta}_{t-1} - \eta\, \mathbf{X}\)</span></li>
      <li><span class="line"><span class="kw">end for</span></span></li>
      <li><span class="line"><span class="kw">return</span> \(\boldsymbol{\Theta}_t\)</span></li>
    </ol>
  </div>
</div>

  <div class="callout">
    <p><strong>Takeaway.</strong> Pion is a drop in replacement for Muon’s NS iteration. Same control flow, same per step cost; only the polynomial coefficients change.</p>
  </div>

  <h2 id="experiments">Experiments</h2>

  <p>We evaluate Pion in two settings:</p>

  <ul>
    <li><strong>VLA training</strong>: two architectures, \(\ell_1\)-regression based VLA Adapter and flow-matching based VLANeXt, with LIBERO and LIBERO Plus as benchmarks.</li>
    <li><strong>RLVR post-training</strong>: GRPO and GMPO on Qwen3 1.7B and Qwen3 4B, with MATH and GSM8K as benchmarks.</li>
  </ul>

  <h3 id="vla">VLA</h3>

  <p><strong>VLA Adapter on LIBERO.</strong> We first compare AdamW, Muon, and Pion on VLA Adapter across the four LIBERO task suites (Object, Spatial, Goal, Long) under a fixed per-suite training budget (1,500 steps for Object, 15,000 steps for the others), together with a finer learning curve on Object.</p>

  <div id="fig-5" class="fig-grid" style="grid-template-columns: minmax(0, 2fr) minmax(0, 1fr); max-width: 90%; margin: 0.6em auto 0.05em; padding: 2px 10px 2px; row-gap: 0;">
  <div class="fig-legend" style="margin-top: 0; margin-bottom: -10px;"><img src="/images/blog/pion/vlaadapter_legend.png" alt="Legend: AdamW, Muon, Pion" /></div>
  <figure style="min-width: 0; height: auto; justify-content: flex-start; margin: 0;"><img src="/images/blog/pion/vlaadapter_libero.png" alt="LIBERO four tasks" style="width: 100%; height: auto; max-height: none; max-width: 100%; display: block; margin: 0;" /><figcaption style="white-space: normal; margin: 0;">(a) Success rates on LIBERO</figcaption></figure>
  <figure style="min-width: 0; height: auto; justify-content: flex-start; margin: 0;"><img src="/images/blog/pion/vlaadapter_training.png" alt="Object training curve" style="width: 100%; height: auto; max-height: none; max-width: 100%; display: block; margin: 0;" /><figcaption style="white-space: normal; margin: 0;">(b) Success rate vs. steps on Object</figcaption></figure>
</div>
  <p class="fig-caption">Figure 5. AdamW, Muon, and Pion for VLA Adapter on LIBERO. (a) Test success rates on LIBERO Object, Spatial, Goal, and Long at a fixed training budget per suite (1,500 steps for Object, 15,000 steps for the others). (b) Test success rate vs. training steps on LIBERO Object.</p>

  <p><a href="#fig-5">Figure 5</a>(a) shows that Pion comprehensively outperforms both Muon and AdamW on every suite. <a href="#fig-5">Figure 5</a>(b) further zooms into the LIBERO Object learning curve: Pion reaches 95.4% success at 500 steps and saturates at 100% by 1,500 steps, while AdamW requires substantially more steps to catch up. This indicates that the spectral high pass substantially reduces the training cost needed to reach the high-success regime.</p>

  <p><strong>VLANeXt on LIBERO and LIBERO Plus.</strong> With flow matching, Pion not only achieves the best success rate on LIBERO but also retains its advantage on the more challenging LIBERO Plus split, particularly under the language (\(+9\) pts), noise (\(+6\) pts), and robot (\(+6\) pts) perturbations; see <a href="#tab-1">Table 1</a>. This confirms our earlier picture that uniform whitening over-amplifies noise directions that do not generalize.</p>

  <div id="tab-1" class="results-table">

    <table>
      <thead>
        <tr>
          <th style="text-align: left">Optimizer</th>
          <th style="text-align: center">LIBERO</th>
          <th style="text-align: center">LIBERO Plus</th>
          <th style="text-align: center">Background</th>
          <th style="text-align: center">Camera</th>
          <th style="text-align: center">Language</th>
          <th style="text-align: center">Layout</th>
          <th style="text-align: center">Light</th>
          <th style="text-align: center">Noise</th>
          <th style="text-align: center">Robot</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td style="text-align: left">AdamW</td>
          <td style="text-align: center">79.45</td>
          <td style="text-align: center">64.57</td>
          <td style="text-align: center">68.97</td>
          <td style="text-align: center">70.38</td>
          <td style="text-align: center">54.50</td>
          <td style="text-align: center">61.80</td>
          <td style="text-align: center">76.35</td>
          <td style="text-align: center">66.37</td>
          <td style="text-align: center">47.04</td>
        </tr>
        <tr>
          <td style="text-align: left">Muon</td>
          <td style="text-align: center">93.65</td>
          <td style="text-align: center">72.34</td>
          <td style="text-align: center">82.72</td>
          <td style="text-align: center">68.00</td>
          <td style="text-align: center">77.53</td>
          <td style="text-align: center">76.21</td>
          <td style="text-align: center">86.17</td>
          <td style="text-align: center">69.98</td>
          <td style="text-align: center">57.36</td>
        </tr>
        <tr>
          <td style="text-align: left"><strong>Pion (Ours)</strong></td>
          <td style="text-align: center"><strong>96.35</strong></td>
          <td style="text-align: center"><strong>75.93</strong></td>
          <td style="text-align: center"><strong>84.53</strong></td>
          <td style="text-align: center"><strong>70.88</strong></td>
          <td style="text-align: center"><strong>86.93</strong></td>
          <td style="text-align: center"><strong>76.71</strong></td>
          <td style="text-align: center"><strong>90.67</strong></td>
          <td style="text-align: center"><strong>76.09</strong></td>
          <td style="text-align: center"><strong>63.18</strong></td>
        </tr>
      </tbody>
    </table>

  </div>
  <p class="fig-caption">Table 1. AdamW, Muon, and Pion for VLANeXt on LIBERO and LIBERO Plus. Best in <strong>bold</strong>.</p>

  <p>To make the LIBERO Plus gap concrete, we roll out the same LIBERO Plus episode under VLANeXt policies trained with each optimizer. AdamW and Muon fail at the grasp or placement stage, while Pion completes the task cleanly.</p>

  <div id="video-1" class="fig-grid" style="grid-template-columns: repeat(3, 1fr);">
  <figure><video src="/videos/blog/pion/ep1373_AdamW.mp4" autoplay="" loop="" muted="" playsinline=""></video><figcaption>(a) AdamW</figcaption></figure>
  <figure><video src="/videos/blog/pion/ep1373_Muon.mp4" autoplay="" loop="" muted="" playsinline=""></video><figcaption>(b) Muon</figcaption></figure>
  <figure><video src="/videos/blog/pion/ep1373_Pion.mp4" autoplay="" loop="" muted="" playsinline=""></video><figcaption>(c) Pion (Ours)</figcaption></figure>
</div>
  <p class="fig-caption">Video 1. Rollouts on the same LIBERO Plus episode (ep1373) under VLANeXt policies trained with the three optimizers. Only Pion reliably completes the task; AdamW and Muon fail in the grasp or placement stage.</p>

  <h3 id="rlvr">RLVR</h3>

  <p>Across all eight RLVR settings (GRPO/GMPO × Qwen3 1.7B/4B × MATH/GSM8K; see <a href="#fig-6">Figure 6</a>), Muon collapses to near-zero accuracy without exception. Pion not only recovers a meaningful training signal but also converges faster than AdamW.</p>

  <div id="fig-6" class="fig-grid" style="grid-template-columns: repeat(4, 1fr);">
  <figure><img src="/images/blog/pion/rl_val_core_grpo_math3-5_qwen3_1.7b.png" alt="" /><figcaption>(a) GRPO, 1.7B, MATH</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_grpo_math3-5_qwen3_4b.png" alt="" /><figcaption>(b) GRPO, 4B, MATH</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_grpo_gsm8k_qwen3_1.7b.png" alt="" /><figcaption>(c) GRPO, 1.7B, GSM8K</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_grpo_gsm8k_qwen3_4b.png" alt="" /><figcaption>(d) GRPO, 4B, GSM8K</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_gmpo_math3-5_qwen3_1.7b.png" alt="" /><figcaption>(e) GMPO, 1.7B, MATH</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_gmpo_math3-5_qwen3_4b.png" alt="" /><figcaption>(f) GMPO, 4B, MATH</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_gmpo_gsm8k_qwen3_1.7b.png" alt="" /><figcaption>(g) GMPO, 1.7B, GSM8K</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_gmpo_gsm8k_qwen3_4b.png" alt="" /><figcaption>(h) GMPO, 4B, GSM8K</figcaption></figure>
</div>
  <p class="fig-caption">Figure 6. AdamW, Muon, and Pion on RLVR: validation accuracy vs training step across eight settings (two algorithms × two model sizes × two benchmarks).</p>

  <p><strong>Reverse ablation: direction of spectral shaping matters.</strong> To verify that the gains come specifically from the high-pass direction, we construct <strong>Low-pass Muon (LPMuon)</strong>, which shares Pion’s NS structure and per-step cost but reverses the filtering direction to low-pass (contracting large singular values and amplifying small ones). LPMuon fails to train: its accuracy stays at the initial checkpoint, see <a href="#fig-7">Figure 7</a>.</p>

  <div id="fig-7" class="fig-grid wrap" style="grid-template-columns: repeat(2, 1fr);">
  <figure><img src="/images/blog/pion/pion0_muon_lowrankmuon.png" alt="Low pass profile" /><figcaption>(a) Low pass scalar map</figcaption></figure>
  <figure><img src="/images/blog/pion/mhlpmuon.png" alt="GSM8K accuracy" /><figcaption>(b) GSM8K accuracy</figcaption></figure>
</div>
  <p class="fig-caption">Figure 7. (a) Scalar map \(f(\sigma)\) of LPMuon. (b) GSM8K accuracy of AdamW, Pion, and LPMuon (Qwen3 1.7B, GRPO).</p>

  <h2 id="bibtex">BibTeX</h2>

  <pre class="bibtex-block"><code>@misc{fan2026rethinkingmuonpretrainingspectral,
      title={Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR}, 
      author={Chongyu Fan and Gaowen Liu and Mingyi Hong and Ramana Rao Kompella and Sijia Liu},
      year={2026},
      eprint={2605.19282},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.19282}, 
}
</code></pre>

</div>

<div class="post-lang post-lang-zh" hidden="">

  <div class="tldr">

    <h2 id="tldr-1">TL;DR</h2>

    <ul>
      <li><strong>Muon</strong> 是一种在 <strong>LLM 预训练</strong>中被广泛采用的“矩阵感知”优化器。它通过 Newton–Schulz (NS) 迭代将所有奇异值统一放大至 1 附近，从而实现动量矩阵正交化。</li>
      <li>然而我们发现，这种“一刀切”放大所有奇异值的做法具有局限性，一旦脱离 LLM 预训练设定（如更换<strong>模态</strong>、<strong>损失</strong>或<strong>学习范式</strong>）便不再适用。</li>
      <li>为此，我们提出 <strong>Pion</strong>（s<strong>P</strong>ectral h<strong>I</strong>gh-pass <strong>O</strong>ptimization on mome<strong>N</strong>tum）。该方法仅需修改 Muon 中 NS 迭代的系数，在保持每步计算开销与 Muon 完全一致的前提下，实现了一种“谱高通”机制，也就是将承载主要信息的头部奇异值锚定在 1 附近，并将噪声主导的尾部奇异值抑制至 0。</li>
    </ul>

  </div>

  <h2 id="section">试一试：σ 被映射到何处？</h2>

  <div class="pion-demo" data-muon-labels="被白化为 1|从 0 拉起|越过 1（震荡）" data-pion-labels="保留（高通）|过滤（被压制）|保留（高通）">
  <div class="pion-demo-body">
    <div class="pion-demo-controls">
      <div class="pion-demo-intro">
        选取一个 <span class="kbd">&sigma; &isin; [0, 1]</span>，观察同一个 <span class="kbd">&sigma;</span> 在 Muon 的 NS 与 Pion 的高通 NS 下分别被映射到何处。Muon 试图将所有 <span class="kbd">&sigma;</span> 放大到 1；Pion 放大头部、缩小尾部。
      </div>
      <div class="pion-demo-slider-row">
        <span class="sigma-label">&sigma;</span>
        <input type="range" min="0" max="1" step="0.001" value="0.35" class="pion-demo-slider" aria-label="奇异值 sigma" />
        <span class="pion-demo-sigma">0.350</span>
      </div>
      <div class="pion-demo-outputs">
        <div class="pion-demo-output muon">
          <div class="out-label">Muon (NS)</div>
          <div class="out-result-wrap">
            <div class="out-divider"></div>
            <div class="out-result">
              <span class="out-input">&sigma; = <span class="out-sigma-in">0.350</span></span>
              <span class="out-arrow-sym">&#x21A6;</span>
              <span class="out-value">&mdash;</span>
            </div>
            <div class="out-tag">&nbsp;</div>
          </div>
        </div>
        <div class="pion-demo-output pion">
          <div class="out-label">Pion (High-pass NS)</div>
          <div class="out-result-wrap">
            <div class="out-divider"></div>
            <div class="out-result">
              <span class="out-input">&sigma; = <span class="out-sigma-in">0.350</span></span>
              <span class="out-arrow-sym">&#x21A6;</span>
              <span class="out-value">&mdash;</span>
            </div>
            <div class="out-tag">&nbsp;</div>
          </div>
        </div>
      </div>
    </div>
    <div class="pion-demo-plot-wrap">
      <svg class="pion-demo-plot" xmlns="http://www.w3.org/2000/svg" aria-hidden="true"></svg>
      <div class="pion-demo-legend">
        <span class="lg lg-muon"><span class="swatch"></span>Muon</span>
        <span class="lg lg-pion"><span class="swatch"></span>Pion</span>
      </div>
    </div>
  </div>
</div>

  <h2 id="section-1">背景</h2>

  <h3 id="muon-1">Muon</h3>

  <p>Muon 是一种”<strong>矩阵感知</strong>“优化器，在 <strong>LLM 预训练</strong>中应用广泛。记权重矩阵为 \(\boldsymbol{\Theta} \in \mathbb{R}^{m \times n}\)，梯度为 \(\mathbf{G}_t\)，动量为 \(\mathbf{M}_t = \mu \mathbf{M}_{t-1} + \mathbf{G}_t\)。Muon 在谱范数下执行最速下降：</p>

\[\boldsymbol{\Theta}_t = \boldsymbol{\Theta}_{t-1} - \eta \, \mathrm{msign}(\mathbf{M}_t).\]

  <p>若 \(\mathbf{M} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top\) 为其紧凑 SVD，则</p>

\[\mathrm{msign}(\mathbf{M}) = \mathbf{U}\, \mathrm{sign}(\boldsymbol{\Sigma})\, \mathbf{V}^\top = \mathbf{U} \mathbf{V}^\top.\]

  <p>所有非零奇异值均被映射至 1 附近，即所谓的<strong>均匀谱白化</strong>。然而在大模型训练中，每步执行一次 SVD 代价过高，因此 Muon 转而用若干步 NS 迭代近似 \(\mathrm{msign}\)。先将输入归一化为 \(\mathbf{X} \leftarrow \mathbf{X} / (\|\mathbf{X}\|_F + \epsilon)\)，再在每一步 NS 中应用一次奇次多项式：</p>

\[\mathbf{X} \leftarrow a\, \mathbf{X} + b\, \mathbf{X}\mathbf{X}^\top \mathbf{X} + c\, \mathbf{X}(\mathbf{X}^\top \mathbf{X})^2,\]

  <p>系数取 \((a, b, c) = (3.4445,\ -4.7750,\ 2.0315)\)。借助恒等式 \(\mathbf{X}(\mathbf{X}^\top \mathbf{X})^j = \mathbf{U}\, \boldsymbol{\Sigma}^{2j+1}\, \mathbf{V}^\top\)，每一步 NS 都保持奇异向量不变，仅通过 \([0, 1]\) 上的一个<strong>标量多项式</strong>对奇异值进行重塑：</p>

\[f(\sigma;\, a, b, c) \,\triangleq\, a\sigma + b\sigma^3 + c\sigma^5.\]

  <p>Muon 的 NS 能将任意 \(\sigma \in (0, 1]\) 放大到 1 附近。该函数的图像见 <a href="#fig-1-zh">图 1</a>。</p>

  <h3 id="muon-">Muon 在其他场景下是否仍然适用？</h3>

  <p>LLM 预训练通常使用 next token prediction loss；更具体地说，任务是<strong>分类</strong>，模态只有<strong>文本</strong>，范式是<strong>监督学习</strong>。当 token 级监督密集且准确时，将所有奇异值放大到 1 是较为合理的默认选择。但 LLM 预训练只是深度学习的一部分，Muon 在不同<strong>模态</strong>、不同<strong>损失</strong>、不同<strong>学习范式</strong>下的表现，仍是值得探索的问题：</p>

  <div class="axes-anim">
  <div class="axes-anim-head">
    <div class="axes-anim-title">
      Muon 在其他场景下是否仍然适用？
    </div>
    <div class="axes-anim-sub">
      LLM 预训练：
      <span class="corner">文本</span>
      &nbsp;&plus;&nbsp;
      <span class="corner">分类</span>
      &nbsp;&plus;&nbsp;
      <span class="corner">监督学习</span>。
      我们将通过更改**模态**、**损失**、**学习范式**来验证 Muon 的有效性。
    </div>
  </div>

  <div class="axis-track">
    <div class="axis-tag">
      <span class="axis-num">1</span>
      <div>
        <div class="axis-name">模态</div>
        <div class="axis-hint">语言 &nbsp;→&nbsp; 视觉 / 动作</div>
      </div>
    </div>
    <div class="axis-from">
      <div class="ax-tok-row">
        <span class="ax-tok">the</span><span class="ax-tok">cat</span><span class="ax-tok">sits</span><span class="ax-tok">on</span><span class="ax-tok">mat</span>
      </div>
      <div class="ax-cell-label">文本 token</div>
    </div>
    <div class="axis-arrow"><div class="track"><span class="pulse p1"></span><span class="pulse p2"></span><span class="pulse p3"></span></div><div class="head"></div></div>
    <div class="axis-to">
      <div class="ax-mod-to">
        <div class="ax-mod-pix" aria-label="image patches">
          <span style="background:#a5b4fc"></span><span style="background:#c7d2fe"></span><span style="background:#818cf8"></span>
          <span style="background:#c7d2fe"></span><span style="background:#a5b4fc"></span><span style="background:#818cf8"></span>
          <span style="background:#818cf8"></span><span style="background:#a5b4fc"></span><span style="background:#c7d2fe"></span>
        </div>
        <svg class="ax-mod-arm" viewBox="0 0 48 48" aria-hidden="true" fill="none">
          <ellipse cx="24" cy="45" rx="12" ry="1.2" fill="rgba(15,23,42,0.10)" />
          <rect x="12" y="40.5" width="24" height="4" rx="1.5" fill="#4485C7" />
          <rect x="22" y="31.5" width="4" height="10" fill="#4485C7" />
          <line x1="24" y1="31.5" x2="11" y2="21" stroke="#4485C7" stroke-width="4" stroke-linecap="round" />
          <circle cx="24" cy="31.5" r="3.2" fill="#4485C7" />
          <circle cx="24" cy="31.5" r="1.3" fill="#fff" />
          <line x1="11" y1="21" x2="33" y2="10" stroke="#4485C7" stroke-width="4" stroke-linecap="round" />
          <circle cx="11" cy="21" r="2.8" fill="#4485C7" />
          <circle cx="11" cy="21" r="1.1" fill="#fff" />
          <circle cx="33" cy="10" r="2.3" fill="#4485C7" />
          <path d="M33 10 L39 6 L41 8" stroke="#4485C7" stroke-width="2.2" stroke-linecap="round" stroke-linejoin="round" />
          <path d="M33 10 L39 14 L41 12" stroke="#4485C7" stroke-width="2.2" stroke-linecap="round" stroke-linejoin="round" />
        </svg>
      </div>
      <div class="ax-cell-label">图像 patch / 机器人动作</div>
    </div>
  </div>

  <div class="axis-track">
    <div class="axis-tag">
      <span class="axis-num">2</span>
      <div>
        <div class="axis-name">损失</div>
        <div class="axis-hint">分类 &nbsp;→&nbsp; 回归</div>
      </div>
    </div>
    <div class="axis-from">
      <svg class="ax-onehot" viewBox="0 0 130 44" aria-hidden="true">
        <line x1="0" y1="20" x2="130" y2="20" stroke="#e5e9ef" stroke-width="0.8" />
        <rect x="6" y="16" width="14" height="4" fill="#93BBE0" rx="1" />
        <rect x="24" y="13" width="14" height="7" fill="#93BBE0" rx="1" />
        <rect x="42" y="10" width="14" height="10" fill="#93BBE0" rx="1" />
        <rect x="60" y="4" width="14" height="16" fill="#4485C7" rx="1" />
        <rect x="78" y="11" width="14" height="9" fill="#93BBE0" rx="1" />
        <rect x="96" y="14" width="14" height="6" fill="#93BBE0" rx="1" />
        <rect x="114" y="16" width="10" height="4" fill="#93BBE0" rx="1" />
        <line x1="0" y1="40" x2="130" y2="40" stroke="#e5e9ef" stroke-width="1" />
        <rect x="6" y="37.5" width="14" height="2.5" fill="#cbd5e1" rx="1" />
        <rect x="24" y="37.5" width="14" height="2.5" fill="#cbd5e1" rx="1" />
        <rect x="42" y="37.5" width="14" height="2.5" fill="#cbd5e1" rx="1" />
        <rect x="60" y="23" width="14" height="17" fill="#4485C7" rx="1" />
        <rect x="78" y="37.5" width="14" height="2.5" fill="#cbd5e1" rx="1" />
        <rect x="96" y="37.5" width="14" height="2.5" fill="#cbd5e1" rx="1" />
        <rect x="114" y="38" width="10" height="2" fill="#cbd5e1" rx="1" />
      </svg>
      <div class="ax-cell-label">下一 token 预测</div>
    </div>
    <div class="axis-arrow"><div class="track"><span class="pulse p1"></span><span class="pulse p2"></span><span class="pulse p3"></span></div><div class="head"></div></div>
    <div class="axis-to">
      <svg class="ax-curve" viewBox="0 0 130 44" aria-hidden="true">
        <line x1="0" y1="40" x2="130" y2="40" stroke="rgba(68,133,199,0.4)" stroke-width="1" />
        <line x1="8" y1="34" x2="124" y2="9" stroke="#4485C7" stroke-width="1.8" stroke-linecap="round" />
        <circle cx="14" cy="33" r="1.8" fill="#4485C7" />
        <circle cx="28" cy="29" r="1.8" fill="#4485C7" />
        <circle cx="42" cy="28" r="1.8" fill="#4485C7" />
        <circle cx="56" cy="22" r="1.8" fill="#4485C7" />
        <circle cx="70" cy="22" r="1.8" fill="#4485C7" />
        <circle cx="84" cy="16" r="1.8" fill="#4485C7" />
        <circle cx="100" cy="14" r="1.8" fill="#4485C7" />
        <circle cx="116" cy="11" r="1.8" fill="#4485C7" />
      </svg>
      <div class="ax-cell-label">连续输出</div>
    </div>
  </div>

  <div class="axis-track">
    <div class="axis-tag">
      <span class="axis-num">3</span>
      <div>
        <div class="axis-name">学习范式</div>
        <div class="axis-hint">监督学习 &nbsp;→&nbsp; 强化学习</div>
      </div>
    </div>
    <div class="axis-from">
      <div class="ax-tape">
        <div class="row">
          <span class="ax-tok">x<sub>1</sub></span><span class="ax-tok">x<sub>2</sub></span><span class="ax-tok">x<sub>3</sub></span><span class="ax-tok">x<sub>4</sub></span><span class="ax-tok">x<sub>5</sub></span>
        </div>
        <div class="row">
          <span class="arr dense">↓</span><span class="arr dense">↓</span><span class="arr dense">↓</span><span class="arr dense">↓</span><span class="arr dense">↓</span>
        </div>
      </div>
      <div class="ax-cell-label">每个 token 都有教师信号</div>
    </div>
    <div class="axis-arrow"><div class="track"><span class="pulse p1"></span><span class="pulse p2"></span><span class="pulse p3"></span></div><div class="head"></div></div>
    <div class="axis-to">
      <div class="ax-tape">
        <div class="row">
          <span class="ax-tok">x<sub>1</sub></span><span class="ax-tok">x<sub>2</sub></span><span class="ax-tok">x<sub>3</sub></span><span class="ax-tok">x<sub>4</sub></span><span class="ax-tok">x<sub>5</sub></span>
        </div>
        <div class="row">
          <span class="arr dot">·</span><span class="arr dot">·</span><span class="arr dot">·</span><span class="arr dot">·</span><span class="arr reward">↓</span>
        </div>
      </div>
      <div class="ax-cell-label">轨迹级奖励</div>
    </div>
  </div>
</div>

  <h2 id="section-2">动机</h2>

  <p>我们考虑了两个 setting：VLA 与 RLVR。VLA 同时改变了模态与损失（引入视觉与动作模态，并将分类损失替换为回归损失）；RLVR 则单独改变学习范式（将监督学习替换为强化学习）。通过对梯度秩和信噪比的分析，我们发现 Muon 在动作模态和强化学习场景下均不适用。</p>

  <h3 id="vla-1">更换模态与损失：VLA</h3>

  <p>VLA 模型由三部分构成：<strong>vision encoder</strong>、<strong>language backbone</strong> 与 <strong>action head</strong>。vision 与 language 模块的输入是文本指令与图像；action head 则对应一种全新的模态，其输出是机器人的动作。相应地，action head 通常采用的是非分类损失：要么是 \(\ell_1\) regression，要么是 flow matching。</p>

  <p>我们用<strong>有效秩</strong>（effective rank, erank）来刻画各模块梯度 \(\mathbf{G} \in \mathbb{R}^{m \times n}\) 的谱结构：</p>

\[\mathrm{erank}(\mathbf{G}) \,\triangleq\, \exp\!\Big( H(\mathbf{p}) \Big), \quad H(\mathbf{p}) = -\sum_{i=1}^n p_i \log p_i, \quad p_i = \frac{\sigma_i(\mathbf{G})}{\sum_j \sigma_j(\mathbf{G})}.\]

  <p>erank 越大，说明梯度的能量分散在很多奇异方向上；越小，说明能量集中在少数几个主方向上。</p>

  <div id="fig-2-zh" class="fig-grid" style="grid-template-columns: repeat(3, 1fr);">
  <figure><img src="/images/blog/pion/erank_heatmap_sampled.png" alt="erank" /><figcaption>(a) 各模块梯度的有效秩 (erank)</figcaption></figure>
  <figure><img src="/images/blog/pion/success_rate.png" alt="success rate" /><figcaption>(b) 测试成功率</figcaption></figure>
  <figure><img src="/images/blog/pion/total_training_time.png" alt="training time" /><figcaption>(c) 总训练时长（小时）</figcaption></figure>
</div>
  <p class="fig-caption">图 2. Muon 在 VLA 训练（VLA Adapter + LIBERO Object）中的局限性。(a) 全模态使用 AdamW 训练时，各模态梯度的有效秩 (erank)。(b)(c) 4.5k 步时的测试成功率与总训练时长（其中 vision 和 language 使用 AdamW，仅切换 action 的优化器）。</p>

  <p>从 <a href="#fig-2-zh">图 2</a>(a) 可以看出，各模态梯度的 erank 在整个训练过程中非常稳定：vision 最高、language 居中，而<strong>action 梯度的 erank 始终最低</strong>。这主要是因为 vision 与 text 输入通常包含更丰富的信息，而 action 仅需 7 个自由度即可表达；此外，action 模态采用的是 regression loss 进行训练，其输出空间远小于 language 与 vision 的离散 token 空间，这使其梯度展现出极强的低秩特性。在这种低秩的 action 梯度上直接施加 Muon 的均匀白化，等同于将微弱的噪声尾部抬升至与少数主方向相同的量级，导致最终的参数更新几乎完全被谱底噪声主导。如 <a href="#fig-2-zh">图 2</a>(b) 所示，Muon 在 action 模块上的成功率甚至不及 AdamW。一个直观的解决方案是 <strong>Low-Rank Muon (LRMuon)</strong>，即在 NS 迭代之前，先通过 SVD 或 Gaussian sketching 将动量矩阵投影到 top-\(k\) 子空间。这种方法虽能恢复成功率，但 <a href="#fig-2-zh">图 2</a>(c) 表明，引入显式低秩投影会带来巨大的计算负担，使整个训练开销激增约一个数量级。</p>

  <div class="callout">
    <p><strong>局限 1（modality + loss）。</strong> 标准 Muon 无法自适应新模态、新损失所带来的谱秩异质性；显式低秩投影虽能恢复成功率，但代价是失去可扩展性。</p>
  </div>

  <h3 id="rlvr-1">更换学习范式：RLVR</h3>

  <p>RLVR 保持 LLM 与文本模态不变，所改变的仅是<strong>学习范式</strong>：将 token 级监督损失（如 SFT）替换为轨迹级针对可验证奖励的策略梯度（如 GRPO）。为了将两种学习范式置于同一坐标下比较，我们考察每一步梯度的信噪比：</p>

\[\mathrm{SNR}(\mathbf{G}) \,\triangleq\, \frac{\|\mathbb{E}[\mathbf{G}]\|_F^2}{\mathbb{E}\big[\,\|\mathbf{G} - \mathbb{E}[\mathbf{G}]\|_F^2\,\big]}.\]

  <p><a href="#fig-3-zh">图 3</a>(a) 中的 SNR 鸿沟有两个结构性来源。一是<strong>监督粒度变粗</strong>：SFT 使用 token 级教师信号，而 GRPO 使用轨迹级奖励，每个 token 实际获得的学习信号要稀疏得多。二是<strong>稳定化机制</strong>：importance sampling、clipping、advantage normalization 均会重新加权或抹除部分 per-token 梯度，进一步放大方差。如 <a href="#fig-3-zh">图 3</a>(b) 所示，Muon 直接对这种低 SNR 梯度做白化，相当于将噪声方向抬升至与信息方向同等量级，最终策略在数步之内便崩溃。</p>

  <div id="fig-3-zh" class="fig-grid wrap" style="grid-template-columns: repeat(2, 1fr);">
  <figure><img src="/images/blog/pion/grad_snr_sft_vs_rl_math3-5_step80.png" alt="SFT vs GRPO SNR" /><figcaption>(a) 梯度 SNR：SFT vs GRPO</figcaption></figure>
  <figure><img src="/images/blog/pion/acc_adamw_muon.png" alt="Accuracy AdamW vs Muon" /><figcaption>(b) MATH500：AdamW vs Muon</figcaption></figure>
</div>
  <p class="fig-caption">图 3. RLVR SNR（Qwen3 1.7B，MATH）。(a) 整段训练中 GRPO 的梯度 SNR 始终明显低于 SFT。(b) 在 GRPO 下，AdamW 稳步提升，而 Muon 在数步之内便崩溃至接近零的准确率。</p>

  <div class="callout">
    <p><strong>局限 2（learning paradigm）。</strong> Muon 的均匀白化会放大低 SNR 的 RLVR 梯度中的噪声方向，因而不适合对噪声敏感的后训练。</p>
  </div>

  <h2 id="section-3">方法</h2>

  <p>局限 1 与局限 2 的来源不同：前者源于不同 modality / loss 的低 erank，后者源于不同 learning paradigm 的低 SNR；但二者在频谱上呈现出相同的形态。在 \(\mathbf{M}_t\) 的 SVD 中，<strong>前若干个</strong>奇异值承载着高信息量的下降方向，<strong>其余大量</strong>小奇异值则被噪声主导（在 erank 较低时表现为谱底，在 SNR 较低时表现为随机估计噪声）。Muon 的 \(\mathrm{msign}\) 将尾部抬升至与头部同等量级，因此在上述两种情形下都会扰乱更新。自然的对策即为<strong>谱高通</strong>：将头部锚定在 1 附近，将尾部抑制到 0。</p>

  <h3 id="ns">高通 NS</h3>

  <p>由于 NS 每一步都通过 \(f(\sigma; a, b, c) = a\sigma + b\sigma^3 + c\sigma^5\) 重塑 \(\sigma\)，因此 NS 的设计归结为对 \(f\) 的设计。仅依靠单一的五次多项式，难以实现我们所需的高通；为此，Pion 将默认的 \(k = 5\) 步 NS 拆分为两段，分别使用不同的系数：</p>

  <ul>
    <li>第一段称为 <strong>Promotion</strong>（提升），多项式 \(f_{\mathrm{p}}\) 运行 \(k_{\mathrm{p}}\) 步：放大头部奇异值到 1 附近，同时保留它们之间的相对大小；</li>
    <li>第二段称为 <strong>Suppression</strong>（抑制），多项式 \(f_{\mathrm{s}}\) 运行 \(k_{\mathrm{s}} = k - k_{\mathrm{p}}\) 步：将已经接近 1 的大奇异值锚定在 1，将仍偏小的奇异值进一步压制至 0。</li>
  </ul>

  <p>我们对 \(f_{\mathrm{p}}\) 给出三条要求：<strong>(P1)</strong> \(\sigma = 1\) 是不动点，\(f_{\mathrm{p}}(1) = 1\)；<strong>(P2)</strong> 一阶平稳，\(f_{\mathrm{p}}'(1) = 0\)；<strong>(P3)</strong> 边界凹性，\(f_{\mathrm{p}}''(1) \leq 0\)，与 (P2) 共同保证 \(\sigma = 1\) 为局部极大点，迭代不会在边界附近越过 1 而发散。(P1) 与 (P2) 将可行解约束为单参数族，进一步结合 (P3) 与 \([0, 1]\) 上的单调性，可行参数收缩为 \(a_{\mathrm{p}} \in [0, 1.875]\)。\(f_{\mathrm{p}}'(0) = a_{\mathrm{p}}\) 控制每一步对小奇异值的提升强度，因此我们直接取其最大可行斜率，从而唯一确定多项式：</p>

\[f_{\mathrm{p}}(\sigma) = 1.875\, \sigma \,-\, 1.25\, \sigma^3 \,+\, 0.375\, \sigma^5.\]

  <p>由于其导数恰为完全平方 \(f_{\mathrm{p}}'(\sigma) = 1.875\, (1 - \sigma^2)^2 \geq 0\)，单调性自然成立。</p>

  <p>Suppression 同样要求 \(f_{\mathrm{s}}(1) = 1\)、\(f_{\mathrm{s}}'(1) = 0\)，并附加一条<strong>谱过滤</strong>条件：\(f_{\mathrm{s}}'(0) = 0\)。在消除原点附近的线性项之后，小奇异值仅能被高阶项逐步缩小到 0。其唯一解为：</p>

\[f_{\mathrm{s}}(\sigma) = 2.5\, \sigma^3 \,-\, 1.5\, \sigma^5.\]

  <p>将 \(k_{\mathrm{p}}\) 步 Promotion 与 \(k_{\mathrm{s}}\) 步 Suppression 串联，即可得到 Pion 的<strong>高通 NS</strong>。固定 \(k = 5\) 使其每步开销与 Muon 完全一致。<a href="#fig-1-zh">图 1</a> 给出了 Muon NS、Promotion、Suppression 与 Pion 高通 NS 的函数图像对比。</p>

  <div id="fig-1-zh" class="fig-grid" style="grid-template-columns: repeat(4, 1fr);">
  <figure><img src="/images/blog/pion/iter_NS.png" alt="Muon NS" /><figcaption>(a) Muon NS</figcaption></figure>
  <figure><img src="/images/blog/pion/iter_P.png" alt="Promotion" /><figcaption>(b) Promotion \(f_{\mathrm{p}}\)</figcaption></figure>
  <figure><img src="/images/blog/pion/iter_S.png" alt="Suppression" /><figcaption>(c) Suppression \(f_{\mathrm{s}}\)</figcaption></figure>
  <figure><img src="/images/blog/pion/iter_pion_mix.png" alt="High pass NS" /><figcaption>(d) Pion 高通 NS</figcaption></figure>
</div>
  <p class="fig-caption">图 1. \(\sigma \in [0, 1]\) 上 \(f(\sigma)\) 的可视化。Muon (a) 将每一个奇异值都放大到 1。Pion 将 Promotion (b) 与 Suppression (c) 组合，得到 (d) 中的高通形态。</p>

  <h3 id="per-head--rlvr">Per-head 模式（用于 RLVR）</h3>

  <p>目前为止，高通 NS 都将每一层的动量 \(\mathbf{M}_t \in \mathbb{R}^{m \times n}\) 作为单一整块处理，与 Muon 完全一致；我们称之为 <strong>default mode</strong>。但我们发现，这种方式并不适用于 RLVR。RLVR 起点是已经 pretrained（或 SFT 过的）LLM，其 attention 层各 head 之间在 \(\|\mathbf{W}_Q^h\|_F\)、\(\|\mathbf{W}_K^h\|_F\)、\(\|\mathbf{W}_V^h\|_F\)、\(\|\mathbf{W}_O^h\|_F\) 上存在显著差异。这种差异同时决定了 forward 的输出与 backward 的梯度，因此不同 head 本应接收<strong>不同尺度的更新</strong>。</p>

  <p>为此，Pion 在 default mode 之外提供一种 <strong>per-head mode</strong>：先沿 head 维度将 attention 投影 reshape 为若干 per-head 子矩阵，再分别在每一个子矩阵上独立运行整套两段式高通 NS。形式上，设 \(H\) 为 attention head 数、\(d_k\) 为单 head 维度，每个 attention 投影（Q / K / V / O 任一）都有一个沿 head 轴的标准 reshape：</p>

\[\mathbf{M}_t \;\xrightarrow{\;\mathrm{Reshape}\;}\; \{\mathbf{M}_t^h\}_{h=1}^{H}, \qquad \mathbf{M}_t^h \in \mathbb{R}^{d \times d_k}.\]

  <p>Per-head mode 在<strong>每个 \(\mathbf{M}_t^h\) 上独立</strong>执行两段式高通 NS：先做 per-head 的 Frobenius 预归一化 \(\mathbf{X}^h \leftarrow \mathbf{M}_t^h / (\|\mathbf{M}_t^h\|_F + \epsilon)\)，再依次执行 \(k_{\mathrm{p}}\) 步 Promotion 与 \(k_{\mathrm{s}}\) 步 Suppression，最后再将 \(\{\mathbf{X}^h\}_{h=1}^H\) reshape 回完整的 \(\mathbf{X} \in \mathbb{R}^{m \times n}\)。由于 \(\mathbf{X}^h (\mathbf{X}^h)^\top \mathbf{X}^h\) 在 GPU 上本就沿 \(h\) 维做了 batch，per-head mode 相对 default mode 的唯一额外开销即为 reshape 本身。</p>

  <p><a href="#fig-4-zh">图 4</a>(a) 在 Qwen3 1.7B 上具体展示了 default mode 的问题：RLVR 起点权重的跨 head 方差 \(\mathrm{Var}_h(\|\mathbf{W}_{0,Q}^h\|_F)\) 在全部 28 层都不可忽略（上方），但在 default mode 的 Pion 下，更新量的跨 head 方差 \(\mathrm{Var}_h(\|\mathbf{W}_{*,Q}^h - \mathbf{W}_{0,Q}^h\|_F)\) 几乎为零（下方）。也就是说，default mode 会拉平各 head 的更新尺度，并把不同 head 的子方向混合在一起，使得各 head 都得到几乎相同的更新，完全无法体现 head 之间的差异性。相比之下，per-head mode 则恢复了与层相关、随 head 变化的更新结构。</p>

  <div id="fig-4-zh" class="fig-grid wrap" style="grid-template-columns: repeat(2, 1fr);">
  <figure><img src="/images/blog/pion/qwen3_1p7b_Q_headnorm_variance_compact_step80.png" alt="Q headnorm variance" /><figcaption>(a) 跨 head Q 投影方差</figcaption></figure>
  <figure><img src="/images/blog/pion/acc_mh.png" alt="Per head ablation accuracy" /><figcaption>(b) MATH500 准确率</figcaption></figure>
</div>
  <p class="fig-caption">图 4. Per-head 高通 NS 在 RLVR 上的效果（Qwen3 1.7B，GRPO + MATH 难度 3 至 5）。(a) Q 投影的跨 head 方差：RLVR 之前 \(\mathrm{Var}_h(\|\mathbf{W}_{0,Q}^h\|_F)\)（上）与 RLVR 更新量 \(\mathrm{Var}_h(\|\mathbf{W}_{*,Q}^h - \mathbf{W}_{0,Q}^h\|_F)\) 在 default 与 per-head Pion 下的对比（下）。(b) AdamW、Muon（default vs per-head）、Pion（default vs per-head）的 MATH500 准确率。</p>

  <p>至此，Pion 的 per-head 高通 NS 包含两个设计：<strong>谱高通</strong>与 <strong>per-head reshape</strong>。一个自然的问题是：两者谁更关键？我们发现两者互补但<strong>并不对称</strong>。<strong>谱高通</strong>是 Pion 提升的<em>主要</em>驱动：在 <a href="#fig-4-zh">图 4</a>(b) 中，即便给 Muon 的 NS 也加上同样的 reshape，得到的 per-head Muon 依然完全崩溃，因为它逐 head 注入的噪声尾部，与它在整块矩阵上注入的噪声同样致命。<strong>Per-head reshape</strong> 则是<em>辅助</em>机制，用于保留 pretrained（或 SFT 过的）LLM 中 attention 层各 head 的差异性。</p>

  <h3 id="section-4">伪代码</h3>

  <p>算法 1 是 Muon 标准的 NS 迭代；算法 2 是 Pion 高通 NS 的默认模式，算法 3 是 per-head 模式。<span class="algo-legend-inline"><span class="lg-item"><span class="swatch s-1"></span> Pion 与 Muon 的差异</span><span class="lg-item"><span class="swatch s-2"></span> per-head 模式独有</span></span></p>

  <div class="algo-row">
  <div class="algo-block compact">
    <div class="algo-header">
      <span class="algo-num">算法 1</span>
      <span class="algo-title">Muon 优化器</span>
    </div>
    <div class="algo-require">
      <span class="kw">Require:</span>&nbsp; 学习率 \(\eta\)，动量系数 \(\mu\)，NS 迭代步数 \(k = 5\)
    </div>
    <ol class="algo-steps">
      <li><span class="line">\(\mathbf{M}_0 \leftarrow \mathbf{0}\)</span></li>
      <li><span class="line"><span class="kw">for</span> \(t = 1, 2, \dots\) <span class="kw">do</span></span></li>
      <li class="ind-1"><span class="line">\(\mathbf{G}_t \leftarrow \nabla_{\boldsymbol{\Theta}} \mathcal{L}_t(\boldsymbol{\Theta}_{t-1})\)</span></li>
      <li class="ind-1"><span class="line">\(\mathbf{M}_t \leftarrow \mu\, \mathbf{M}_{t-1} + \mathbf{G}_t\)</span></li>
      <li class="ind-1"><span class="line">\(\mathbf{X} \leftarrow \mathbf{M}_t / (\lVert \mathbf{M}_t \rVert_F + \epsilon)\)</span><span class="comment">谱预归一化</span></li>
      <li class="ind-1"><span class="line"><span class="kw">for</span> \(i = 1, \dots, k\) <span class="kw">do</span></span><span class="comment">\((a, b, c) = (3.4445, -4.7750, 2.0315)\)</span></li>
      <li class="ind-2"><span class="line">\(\mathbf{X} \leftarrow a\mathbf{X} + b\mathbf{X}\mathbf{X}^\top\mathbf{X} + c\mathbf{X}(\mathbf{X}^\top\mathbf{X})^2\)</span></li>
      <li class="ind-1"><span class="line"><span class="kw">end for</span></span></li>
      <li class="ind-1"><span class="line">\(\boldsymbol{\Theta}_t \leftarrow \boldsymbol{\Theta}_{t-1} - \eta\, \mathbf{X}\)</span></li>
      <li><span class="line"><span class="kw">end for</span></span></li>
      <li><span class="line"><span class="kw">return</span> \(\boldsymbol{\Theta}_t\)</span></li>
    </ol>
  </div>

  <div class="algo-block compact">
    <div class="algo-header">
      <span class="algo-num">算法 2</span>
      <span class="algo-title">Pion 优化器（默认模式：整块矩阵高通 NS）</span>
    </div>
    <div class="algo-require">
      <span class="kw">Require:</span>&nbsp; 学习率 \(\eta\)，动量系数 \(\mu\)，Promotion 步数 \(k_{\mathrm{p}}\)
    </div>
    <ol class="algo-steps">
      <li class="algo-diff"><span class="line">\(k_{\mathrm{s}} \leftarrow 5 - k_{\mathrm{p}}\)</span><span class="comment">把 \(k = 5\) 拆成 \(k_{\mathrm{p}} + k_{\mathrm{s}}\)</span></li>
      <li><span class="line">\(\mathbf{M}_0 \leftarrow \mathbf{0}\)</span></li>
      <li><span class="line"><span class="kw">for</span> \(t = 1, 2, \dots\) <span class="kw">do</span></span></li>
      <li class="ind-1"><span class="line">\(\mathbf{G}_t \leftarrow \nabla_{\boldsymbol{\Theta}} \mathcal{L}_t(\boldsymbol{\Theta}_{t-1})\)</span></li>
      <li class="ind-1"><span class="line">\(\mathbf{M}_t \leftarrow \mu\, \mathbf{M}_{t-1} + \mathbf{G}_t\)</span></li>
      <li class="ind-1"><span class="line">\(\mathbf{X} \leftarrow \mathbf{M}_t / (\lVert \mathbf{M}_t \rVert_F + \epsilon)\)</span><span class="comment">谱预归一化</span></li>
      <li class="ind-1 algo-diff"><span class="line"><span class="kw">for</span> \(i = 1, \dots, k_{\mathrm{p}}\) <span class="kw">do</span></span><span class="comment">第一阶段 Promotion，\((a_{\mathrm{p}}, b_{\mathrm{p}}, c_{\mathrm{p}}) = (1.875, -1.25, 0.375)\)</span></li>
      <li class="ind-2 algo-diff"><span class="line">\(\mathbf{X} \leftarrow a_{\mathrm{p}}\mathbf{X} + b_{\mathrm{p}}\mathbf{X}\mathbf{X}^\top\mathbf{X} + c_{\mathrm{p}}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^2\)</span></li>
      <li class="ind-1"><span class="line"><span class="kw">end for</span></span></li>
      <li class="ind-1 algo-diff"><span class="line"><span class="kw">for</span> \(j = 1, \dots, k_{\mathrm{s}}\) <span class="kw">do</span></span><span class="comment">第二阶段 Suppression，\((a_{\mathrm{s}}, b_{\mathrm{s}}, c_{\mathrm{s}}) = (0, 2.5, -1.5)\)</span></li>
      <li class="ind-2 algo-diff"><span class="line">\(\mathbf{X} \leftarrow a_{\mathrm{s}}\mathbf{X} + b_{\mathrm{s}}\mathbf{X}\mathbf{X}^\top\mathbf{X} + c_{\mathrm{s}}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^2\)</span></li>
      <li class="ind-1 algo-diff"><span class="line"><span class="kw">end for</span></span></li>
      <li class="ind-1"><span class="line">\(\boldsymbol{\Theta}_t \leftarrow \boldsymbol{\Theta}_{t-1} - \eta\, \mathbf{X}\)</span></li>
      <li><span class="line"><span class="kw">end for</span></span></li>
      <li><span class="line"><span class="kw">return</span> \(\boldsymbol{\Theta}_t\)</span></li>
    </ol>
  </div>

  <div class="algo-block compact">
    <div class="algo-header">
      <span class="algo-num">算法 3</span>
      <span class="algo-title">Pion 优化器（per-head 模式：按 attention head 高通 NS）</span>
    </div>
    <div class="algo-require">
      <span class="kw">Require:</span>&nbsp; 学习率 \(\eta\)，动量系数 \(\mu\)，Promotion 步数 \(k_{\mathrm{p}}\)，head 数 \(H\)
    </div>
    <ol class="algo-steps">
      <li class="algo-diff"><span class="line">\(k_{\mathrm{s}} \leftarrow 5 - k_{\mathrm{p}}\)</span><span class="comment">把 \(k = 5\) 拆成 \(k_{\mathrm{p}} + k_{\mathrm{s}}\)</span></li>
      <li><span class="line">\(\mathbf{M}_0 \leftarrow \mathbf{0}\)</span></li>
      <li><span class="line"><span class="kw">for</span> \(t = 1, 2, \dots\) <span class="kw">do</span></span></li>
      <li class="ind-1"><span class="line">\(\mathbf{G}_t \leftarrow \nabla_{\boldsymbol{\Theta}} \mathcal{L}_t(\boldsymbol{\Theta}_{t-1})\)</span></li>
      <li class="ind-1"><span class="line">\(\mathbf{M}_t \leftarrow \mu\, \mathbf{M}_{t-1} + \mathbf{G}_t\)</span></li>
      <li class="ind-1 algo-diff-2"><span class="line">\(\{\mathbf{M}_t^h\}_{h=1}^{H} \leftarrow \mathrm{Reshape}(\mathbf{M}_t)\)</span><span class="comment">沿 head 维度切分</span></li>
      <li class="ind-1 algo-diff-2"><span class="line">\(\mathbf{X}^h \leftarrow \mathbf{M}_t^h / (\lVert \mathbf{M}_t^h \rVert_F + \epsilon),\ \forall\, h\)</span><span class="comment">per-head 预归一化</span></li>
      <li class="ind-1 algo-diff"><span class="line"><span class="kw">for</span> \(i = 1, \dots, k_{\mathrm{p}}\) <span class="kw">do</span></span><span class="comment">第一阶段 Promotion，沿 \(H\) batch</span></li>
      <li class="ind-2 algo-diff"><span class="line">\(\mathbf{X}^h \leftarrow a_{\mathrm{p}}\mathbf{X}^h + b_{\mathrm{p}}\mathbf{X}^h(\mathbf{X}^h)^\top\mathbf{X}^h + c_{\mathrm{p}}\mathbf{X}^h\bigl((\mathbf{X}^h)^\top\mathbf{X}^h\bigr)^2\)</span></li>
      <li class="ind-1"><span class="line"><span class="kw">end for</span></span></li>
      <li class="ind-1 algo-diff"><span class="line"><span class="kw">for</span> \(j = 1, \dots, k_{\mathrm{s}}\) <span class="kw">do</span></span><span class="comment">第二阶段 Suppression，沿 \(H\) batch</span></li>
      <li class="ind-2 algo-diff"><span class="line">\(\mathbf{X}^h \leftarrow a_{\mathrm{s}}\mathbf{X}^h + b_{\mathrm{s}}\mathbf{X}^h(\mathbf{X}^h)^\top\mathbf{X}^h + c_{\mathrm{s}}\mathbf{X}^h\bigl((\mathbf{X}^h)^\top\mathbf{X}^h\bigr)^2\)</span></li>
      <li class="ind-1 algo-diff"><span class="line"><span class="kw">end for</span></span></li>
      <li class="ind-1 algo-diff-2"><span class="line">\(\mathbf{X} \leftarrow \mathrm{Reshape}^{-1}(\{\mathbf{X}^h\}_{h=1}^{H})\)</span><span class="comment">合并 per-head 矩阵</span></li>
      <li class="ind-1"><span class="line">\(\boldsymbol{\Theta}_t \leftarrow \boldsymbol{\Theta}_{t-1} - \eta\, \mathbf{X}\)</span></li>
      <li><span class="line"><span class="kw">end for</span></span></li>
      <li><span class="line"><span class="kw">return</span> \(\boldsymbol{\Theta}_t\)</span></li>
    </ol>
  </div>
</div>

  <div class="callout">
    <!-- **一句话总结。** Pion 是 Muon NS 的即插即用替代方案：控制流不动、每步开销不动，只换多项式系数。 -->
  </div>

  <h2 id="section-5">实验</h2>

  <p>我们在两个 setting 验证 Pion：</p>

  <ul>
    <li><strong>VLA 训练</strong>：两种结构，基于 \(\ell_1\)-regression 的 VLA Adapter 与基于 flow-matching 的 VLANeXt，以 LIBERO 和 LIBERO Plus 作为 benchmark；</li>
    <li><strong>RLVR 后训练</strong>：算法选用 GRPO 和 GMPO；模型选用 Qwen3 1.7B 和 Qwen3 4B；benchmark 为 MATH 和 GSM8K。</li>
  </ul>

  <h3 id="vla-2">VLA</h3>

  <p><strong>VLA Adapter on LIBERO。</strong> 我们首先在 VLA Adapter 上对 AdamW、Muon、Pion 进行比较：在 LIBERO 的四个子集（Object、Spatial、Goal、Long）上，每个子集采用固定训练预算（Object 1,500 步，其余 15,000 步），并在 Object 上额外给出更精细的训练曲线。</p>

  <div id="fig-5-zh" class="fig-grid" style="grid-template-columns: minmax(0, 2fr) minmax(0, 1fr); max-width: 90%; margin: 0.6em auto 0.05em; padding: 2px 10px 2px; row-gap: 0;">
  <div class="fig-legend" style="margin-top: 0; margin-bottom: -10px;"><img src="/images/blog/pion/vlaadapter_legend.png" alt="Legend: AdamW, Muon, Pion" /></div>
  <figure style="min-width: 0; height: auto; justify-content: flex-start; margin: 0;"><img src="/images/blog/pion/vlaadapter_libero.png" alt="LIBERO four tasks" style="width: 100%; height: auto; max-height: none; max-width: 100%; display: block; margin: 0;" /><figcaption style="white-space: normal; margin: 0;">(a) LIBERO 上的成功率</figcaption></figure>
  <figure style="min-width: 0; height: auto; justify-content: flex-start; margin: 0;"><img src="/images/blog/pion/vlaadapter_training.png" alt="Object training curve" style="width: 100%; height: auto; max-height: none; max-width: 100%; display: block; margin: 0;" /><figcaption style="white-space: normal; margin: 0;">(b) Object 上的成功率 vs 训练步数</figcaption></figure>
</div>
  <p class="fig-caption">图 5. AdamW、Muon、Pion 用于 VLA Adapter on LIBERO。(a) LIBERO Object、Spatial、Goal、Long 四个子集上的成功率，每个子集采用固定的训练预算（Object 1,500 步，其余 15,000 步）。(b) LIBERO Object 上的成功率 vs 训练步数。</p>

  <p><a href="#fig-5-zh">图 5</a>(a) 表明，Pion 在每个子集上都全面优于 Muon 和 AdamW。<a href="#fig-5-zh">图 5</a>(b) 进一步给出 LIBERO Object 上的训练曲线：Pion 在 500 步即达到 95.4% 成功率，并在 1,500 步时饱和至 100%；AdamW 则需要明显更多的步数才能逼近。这说明谱高通大幅降低了到达高成功率所需的训练成本。</p>

  <p>在 VLANeXt（flow matching）上，Pion 不仅在 LIBERO 上取得最佳成功率，在更具挑战性的 LIBERO Plus 上也能继续保持优势，尤其是 language（\(+9\) 分）、noise（\(+6\) 分）、robot（\(+6\) 分）等几项扰动；详见 <a href="#tab-1-zh">表 1</a>。这恰好印证了我们之前的判断：均匀白化会过度放大那些无法泛化的噪声方向。</p>

  <div id="tab-1-zh" class="results-table">

    <table>
      <thead>
        <tr>
          <th style="text-align: left">优化器</th>
          <th style="text-align: center">LIBERO</th>
          <th style="text-align: center">LIBERO Plus</th>
          <th style="text-align: center">Background</th>
          <th style="text-align: center">Camera</th>
          <th style="text-align: center">Language</th>
          <th style="text-align: center">Layout</th>
          <th style="text-align: center">Light</th>
          <th style="text-align: center">Noise</th>
          <th style="text-align: center">Robot</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td style="text-align: left">AdamW</td>
          <td style="text-align: center">79.45</td>
          <td style="text-align: center">64.57</td>
          <td style="text-align: center">68.97</td>
          <td style="text-align: center">70.38</td>
          <td style="text-align: center">54.50</td>
          <td style="text-align: center">61.80</td>
          <td style="text-align: center">76.35</td>
          <td style="text-align: center">66.37</td>
          <td style="text-align: center">47.04</td>
        </tr>
        <tr>
          <td style="text-align: left">Muon</td>
          <td style="text-align: center">93.65</td>
          <td style="text-align: center">72.34</td>
          <td style="text-align: center">82.72</td>
          <td style="text-align: center">68.00</td>
          <td style="text-align: center">77.53</td>
          <td style="text-align: center">76.21</td>
          <td style="text-align: center">86.17</td>
          <td style="text-align: center">69.98</td>
          <td style="text-align: center">57.36</td>
        </tr>
        <tr>
          <td style="text-align: left"><strong>Pion (Ours)</strong></td>
          <td style="text-align: center"><strong>96.35</strong></td>
          <td style="text-align: center"><strong>75.93</strong></td>
          <td style="text-align: center"><strong>84.53</strong></td>
          <td style="text-align: center"><strong>70.88</strong></td>
          <td style="text-align: center"><strong>86.93</strong></td>
          <td style="text-align: center"><strong>76.71</strong></td>
          <td style="text-align: center"><strong>90.67</strong></td>
          <td style="text-align: center"><strong>76.09</strong></td>
          <td style="text-align: center"><strong>63.18</strong></td>
        </tr>
      </tbody>
    </table>

  </div>
  <p class="fig-caption">表 1. AdamW、Muon、Pion 用于 VLANeXt on LIBERO 与 LIBERO Plus。每列最优用<strong>粗体</strong>。</p>

  <p>为了把 LIBERO Plus 上的差距更直观地展示出来，我们在同一个 LIBERO Plus episode 上，让三种优化器各自训练得到的 VLANeXt 策略各 rollout 一次。AdamW 与 Muon 在抓取或放置阶段失败，而 Pion 顺利完成任务。</p>

  <div id="video-1-zh" class="fig-grid" style="grid-template-columns: repeat(3, 1fr);">
  <figure><video src="/videos/blog/pion/ep1373_AdamW.mp4" autoplay="" loop="" muted="" playsinline=""></video><figcaption>(a) AdamW</figcaption></figure>
  <figure><video src="/videos/blog/pion/ep1373_Muon.mp4" autoplay="" loop="" muted="" playsinline=""></video><figcaption>(b) Muon</figcaption></figure>
  <figure><video src="/videos/blog/pion/ep1373_Pion.mp4" autoplay="" loop="" muted="" playsinline=""></video><figcaption>(c) Pion (Ours)</figcaption></figure>
</div>
  <p class="fig-caption">视频 1. LIBERO Plus 同一个 episode（ep1373）在 AdamW、Muon、Pion 三种优化器训练得到的 VLANeXt 策略下的 rollout 轨迹。</p>

  <h3 id="rlvr-2">RLVR</h3>

  <p>在全部 8 个 RLVR setting 上（GRPO/GMPO × Qwen3 1.7B/4B × MATH/GSM8K，见 <a href="#fig-6-zh">图 6</a>），Muon 无一例外地崩溃至接近零的准确率；Pion 不仅恢复出有意义的训练信号，收敛速度也快于 AdamW。</p>

  <div id="fig-6-zh" class="fig-grid" style="grid-template-columns: repeat(4, 1fr);">
  <figure><img src="/images/blog/pion/rl_val_core_grpo_math3-5_qwen3_1.7b.png" alt="" /><figcaption>(a) GRPO, 1.7B, MATH</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_grpo_math3-5_qwen3_4b.png" alt="" /><figcaption>(b) GRPO, 4B, MATH</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_grpo_gsm8k_qwen3_1.7b.png" alt="" /><figcaption>(c) GRPO, 1.7B, GSM8K</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_grpo_gsm8k_qwen3_4b.png" alt="" /><figcaption>(d) GRPO, 4B, GSM8K</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_gmpo_math3-5_qwen3_1.7b.png" alt="" /><figcaption>(e) GMPO, 1.7B, MATH</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_gmpo_math3-5_qwen3_4b.png" alt="" /><figcaption>(f) GMPO, 4B, MATH</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_gmpo_gsm8k_qwen3_1.7b.png" alt="" /><figcaption>(g) GMPO, 1.7B, GSM8K</figcaption></figure>
  <figure><img src="/images/blog/pion/rl_val_core_gmpo_gsm8k_qwen3_4b.png" alt="" /><figcaption>(h) GMPO, 4B, GSM8K</figcaption></figure>
</div>
  <p class="fig-caption">图 6. RLVR 上 AdamW、Muon、Pion 的验证准确率 vs 训练步：8 个 setting（两种算法 × 两种模型规模 × 两个 benchmark）。</p>

  <p><strong>反向消融：方向才是关键。</strong> 为确认增益的确来自”高通”这一方向，我们构造了 <strong>Low-pass Muon (LPMuon)</strong>：它与 Pion 共享 NS 结构与每步开销，但将滤波方向反转为低通。结果 LPMuon 完全无法训练，准确率始终停留在初始 checkpoint 水平，见 <a href="#fig-7-zh">图 7</a>。</p>

  <div id="fig-7-zh" class="fig-grid wrap" style="grid-template-columns: repeat(2, 1fr);">
  <figure><img src="/images/blog/pion/pion0_muon_lowrankmuon.png" alt="Low pass profile" /><figcaption>(a) 低通标量映射</figcaption></figure>
  <figure><img src="/images/blog/pion/mhlpmuon.png" alt="GSM8K accuracy" /><figcaption>(b) GSM8K 准确率</figcaption></figure>
</div>
  <p class="fig-caption">图 7. (a) LPMuon 的标量映射 \(f(\sigma)\)。(b) AdamW、Pion、LPMuon 的 GSM8K 准确率（Qwen3 1.7B，GRPO）。</p>

  <h2 id="bibtex-zh">BibTeX</h2>

  <pre class="bibtex-block"><code>@misc{fan2026rethinkingmuonpretrainingspectral,
      title={Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR}, 
      author={Chongyu Fan and Gaowen Liu and Mingyi Hong and Ramana Rao Kompella and Sijia Liu},
      year={2026},
      eprint={2605.19282},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.19282}, 
}
</code></pre>

</div>]]></content><author><name>Chongyu Fan</name></author><category term="optimizer" /><category term="vla" /><category term="rlvr" /><summary type="html"><![CDATA[Muon orthogonalizes the momentum matrix and pushes every singular value to one. This works beautifully for LLM pretraining, which is essentially next token classification on text via supervised learning. But what happens when we move along three orthogonal axes: a different modality, a different loss, or a different learning paradigm? Pion is a drop in replacement for Muon's Newton Schulz iteration that fixes the spectral mismatch we observe along all three axes.]]></summary></entry></feed>