Qwen3-1.7B-GameRL-AdaptiveCurriculum

LoRA adapters trained via multi-game self-play RL on TextArena environments (Kuhn Poker, TicTacToe, Simple Negotiation) with different game sampling strategies.

Author: Jinyu Xiang

Code: github.com/XiangJinyu/textarena-rl

Checkpoints

Directory Strategy Steps Best Result
./ (root) E3: Inverse Win Rate 199 GSM8K 78% (ckpt-100)
e2-uniform/ E2: Uniform Random 199 (effective) KP 37.5%, GSM8K 74%
e4-learning-progress/ E4: Learning Progress 199 TTT 65%, MATH500 53%

Results

Game Win Rates (vs Gemini-2.0-Flash-Lite)

Game Base E2 (Uniform) E3 (Inv. WR) E4 (Learn. Prog.)
KuhnPoker 26.2% 37.5% 27.6% 23.5%
TicTacToe 41.5% 61.5% 56.8% 65.0%
SimpleNeg. 21.9% 13.3% 22.3% 25.9%

Math Benchmark Transfer (no math training data)

Benchmark Base E2 (best) E3 (best) E4 (best)
GSM8K 61% 74% (+13) 78% (+17) 74% (+13)
MATH500 41% 49% (+8) 48% (+7) 53% (+12)

Key Findings

  1. Adaptive > Uniform for balance: Uniform (E2) excels on structured games but significantly degrades on negotiation (-8.6pp). Adaptive strategies maintain all games at or above base level.
  2. Learning Progress (E4) is best overall: Highest TTT win rate (65.0%, +23.5pp), strongest MATH500 transfer (+12pp), and monotonically improving performance with no late-stage regression.
  3. Inverse Win Rate (E3) peaks earlier: Best GSM8K (78%, +17pp) at mid-training but shows mild late regression.
  4. Game self-play transfers to math: Up to +17pp GSM8K and +12pp MATH500 without any math training data, matching SPIRAL's results with a 2.3x smaller model.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B-Base")

# E4 (Learning Progress) — best overall
model_e4 = PeftModel.from_pretrained(
    base,
    "XiangJinYu/Qwen3-1.7B-GameRL-AdaptiveCurriculum",
    subfolder="e4-learning-progress"
)

# E3 (Inverse Win Rate) — best GSM8K
model_e3 = PeftModel.from_pretrained(
    base,
    "XiangJinYu/Qwen3-1.7B-GameRL-AdaptiveCurriculum"
)

# E2 (Uniform) — baseline comparison
model_e2 = PeftModel.from_pretrained(
    base,
    "XiangJinYu/Qwen3-1.7B-GameRL-AdaptiveCurriculum",
    subfolder="e2-uniform"
)

Training Details

  • Base model: Qwen/Qwen3-1.7B-Base
  • Method: REINFORCE with Role-conditioned Advantage Estimation (RAE), LoRA rank=32, alpha=32
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Games: Kuhn Poker + TicTacToe + Simple Negotiation (TextArena)
  • Framework: UnstableBaselines (pip install unstable-rl)
  • Rollouts per step: 128 (E3/E4), 64 (E2)
  • Learning rate: 1e-5
  • Training steps: 200
  • vLLM temperature: 0.6
  • Max generation length: 4096 tokens
  • Hardware: 3xA100 PCIe 80GB (E3/E4), 2xH100 (E2)
  • Training time: ~3.3h (E3/E4)

Citation

@article{xiang2026adaptive-game-curriculum,
  title={Adaptive Game Curriculum for Multi-Turn Self-Play Reinforcement Learning in Language Models},
  author={Xiang, Jinyu},
  year={2026},
  url={https://github.com/XiangJinyu/textarena-rl}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for XiangJinYu/Qwen3-1.7B-GameRL-AdaptiveCurriculum

Adapter
(36)
this model