Blog posts

2026

UP: Unbounded Positive Asymmetric Optimization for Breaking the Exploration-Stability Dilemma

Published: July 07, 2026

IS-based RL for LLM reasoning faces an exploration-stability dilemma: pure importance sampling explodes, and the clipping used to tame it ties a token’s update budget to the old policy. By formalizing the Probability Capacity and anchoring the policy to itself with a stop-gradient, Unbounded Positive Asymmetric Optimization (UP) unleashes stable, unclipped gradients for correct rollouts while keeping clipping as a safeguard for wrong ones. UP is a plug-and-play objective that improves DAPO, GRPO, and GSPO across dense, MoE, and vision-language models.

Rethinking Muon Beyond Pretraining: Spectral Failures and High Pass Remedies for VLA and RLVR

Published: May 15, 2026

Muon orthogonalizes the momentum matrix and pushes every singular value to one. This works beautifully for LLM pretraining, which is essentially next token classification on text via supervised learning. But what happens when we move along three orthogonal axes: a different modality, a different loss, or a different learning paradigm? Pion is a drop in replacement for Muon’s Newton Schulz iteration that fixes the spectral mismatch we observe along all three axes.

Chongyu Fan

Blog posts

2026

UP: Unbounded Positive Asymmetric Optimization for Breaking the Exploration-Stability Dilemma

Rethinking Muon Beyond Pretraining: Spectral Failures and High Pass Remedies for VLA and RLVR