Soft Adaptive Policy Optimisation (SAPO): Qwen’s New Standard for Stable RL in Large Language Models

soft adaptive policy optimisation (sapo) qwen’s new standard for stable rl in large language models

Learning through reinforcement (RL) is essential for improving the performance of large language models and multimodal systems by enabling more accurate reasoning, coding, and decision-making. Traditional RL methods, .

To tackle the long-standing stability problems, the Qwen research team introduced Soft Adaptive Policy Optimisation (SAPO) in November 2025. SAPO provides a smoother, more flexible alternative to hard-clipped RL methods, enabling longer, steadier training runs and enhanced performance across reasoning and multimodal tasks.

This article explains how Soft Adaptive Policy Optimisation (SAPO) is implemented, the issues it addresses relative to previous RL strategies, and how it represents a significant leap forward in model alignment and RL-related optimisation for LLMs.

Why Traditional RL Struggles With Large Language Models?

1. Token-level instability

The majority of RL methods used in LLM training operate at the token level. However, rewards from human feedback, or even evaluation metrics, are present at the sequence level. This creates noise in the gradients because each token gets credit or blame it is not entitled to.

2. A wide range of importance ratios

LLMs produce long sequences, and the probability of each token under the new rules may differ significantly from that under the previous policy. The resulting importance ratio spikes create unstable gradients, particularly in MoE models, where routing decisions increase variance.

3. Hard clipping can be tough

PPO, as well as PPO-based methods, depend on clipping ratios. Although clipping can prevent divergence, it has two disadvantages:

  • Functional gradients can vanish abruptly.
  • If any object is found to cross the boundary, all learning signals are cancelled.

The result results in unreliable updates and inefficiency during training.

From GRPO and GSPO to SAPO: The Evolution of Stable RL

Before SAPO, Qwen introduced two essential methods:

GRPO (Group Relative Policy Optimisation)

GRPO is a normalisation tool that allows advantages to be shared across multiple responses to each prompt. It enhances stability but depends on token importance ratios, leaving high variance unresolved.

GSPO (Group Sequence Policy Optimisation)

GSPO calculates important ratios on a layer of the sequence. This aligns the structure of rewards more naturally and dramatically reduces variation. However, GSPO does not yet use hard clipping, which may cause gradients to be lost if sequences exceed the clipping boundaries.

SAPO enhances both

SAPO retains token-level flexibility and preserves sequence-level coherence; however, it replaces the brittle clipping step with soft gating, avoiding abrupt gradient cutoffs.

What SAPO Is and How It Works?

Soft Adaptive Policies Optimisation introduces a gating mechanism that continuously scales gradients rather than abruptly cutting them.

1. Soft, Temperature-Controlled Gate

SAPO determines the importance ratio between the old and the new policy. Instead of deciding whether it exceeds a specific limit, SAPO passes this ratio through a soft gate, usually using a sigmoid-like function.

The gate that this one is:

  • Smoothly reduces the magnitude of the gradient when ratios diverge from 1
  • Prevents sudden drops in gradient flow
  • allows small but important updates, even for off-policy tokens

2. Asymmetric Temperatures

SAPO utilises different temperatures to achieve positive and negative benefits. This allows the optimiser to penalise undesirable actions more harshly, while allowing good actions to make adjustments more quickly.

This asymmetry dramatically enhances the stability of MoE models, where token-level variance is naturally high. 

3. Sequence-Level Coherence + Token-Level Adaptivity

SAPO combines the advantages of GSPO with GRPO.

  • Sequence-level coherence occurs when token ratios stay near 1. SAPO behaves as a trust region for sequences, stabilising updates.
  • Token-level adaptive: The tokens which diverge may still provide valuable information about gradients instead of suffocating the whole process’s learning signals.

The two-sided behaviour improves quality and stability.

Why SAPO Matters? Key Benefits

1. More Stable RL Runs

The smooth gate minimises gradient spikes and also collapses. Qwen has reported significantly more continuous RL runs, notably with large models and MoE architectures.

2. Improved Sample Efficiency

Unlike clipped methods, which discard large portions of gradient signals, SAPO retains only partial updates. Each sample contributes to the entire training pipeline, making it more efficient.

3. Higher Pass@1 and Task Accuracy

For math, coding or multimodal benchmarks, SAPO enhances the accuracy of direct measurements. These improvements are evident across the Qwen VL and all LLM families.

4. Designed for Modern Architectures

LLMs are transforming into:

  • deeper
  • wider
  • more expert-routed
  • more multimodal

SAPO directly addresses the issues caused by these traits, making it future-proof.

Practical Implications for LLM Developers

Easier Hyperparameter Tuning

Hard clipping is often a process that requires precise thresholds for each model. The SAPO temperature parameter is softer, more flexible, and much easier to adjust.

Better Fit for MoE Models

Because MoE routing increases the probability variance, SAPO’s soft gating is ideally suited to stabilising gradients.

Drop-in Compatibility

SAPO is integrated into standard RLHF pipelines without requiring significant design changes. It is compatible with widely employed sequence-sampling, reward normalisation, and preference modelling configurations.

More Compute-Efficient RL

Because SAPO uses gradients more efficiently, developers can achieve the same performance with fewer RL steps or a smaller sampling budget.

Soft Adaptive Policy Optimisation: Limitations and Open Challenges

Even though SAPO is a significant improvement, it’s not a complete solution:

  • Temperature Tuning Plays a Role: Using extreme temperatures can hinder training, and excessively low temperatures can reintroduce instability.
  • Long Sequences Increase the Computational Burden: Token-level gating incurs per-token overhead.
  • Generalisation Beyond Math and Coding: VL requires greater evaluation. The initial results are good, but more testing is needed.
  • The Quality of Reward Models Continues to Hinder SAPO’s Performance: SAPO stabilises training, but isn’t able to correct reward misalignment or weak evaluation signals.

Final Thoughts

Soft Adaptive Policy Optimisation (SAPO) was created in Qwen’s research team. The Qwen group of researchers is the most significant improvement of RL-based LLM optimisation. By replacing brittle clipping with a temperature-controlled, continuous gating mechanism, SAPO delivers more stable training, better sample efficiency, and stronger performance across reasoning and multimodal tasks.

As LLM architectures become more complex, particularly with the growth of MoE and multimodal systems, SAPO provides a scalable platform for next-generation RL refinement. Researchers and engineers seeking more accurate alignment and optimisation will find that SAPO is among the most significant RL developments for LLMs in recent years.

Frequently Asked Questions

1. Who developed Soft Adaptive Policy Optimisation (SAPO)?

SAPO was first introduced through the Qwen research team and the developers of both the Qwen models, Qwen and Qwen VL.

2. What is SAPO different in comparison to PPO and GRPO?

SAPO replaces hard clipping with a soft, temperature-controlled gate that avoids abrupt gradient suppression, improving stability and sample efficiency.

3. Is SAPO useful for MoE models?

Yes. SAPO is especially effective for MoE architectures where token-level probability variance is higher and traditional clipping techniques are more challenging to implement.

4. Does SAPO increase the cost of computing?

Slightly. Gating at the token level adds computation per token. However, the stability improvement can reduce the number of wasted training samples and overall costs.

5. Can SAPO be used with conventional RLHF pipelines?

Yes. SAPO is fully compatible with reward modelling, preferences learning, multi-sample evaluative, and group-based RL workflows.

6. What tasks see the most significant improvement with SAPO?

Qwen performs better in math, coding, and multimodal reasoning, especially on the Pass@1 measurements.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top