Qwen3-1.7B-GameRL-AdaptiveCurriculum
LoRA adapters trained via multi-game self-play RL on TextArena environments (Kuhn Poker, TicTacToe, Simple Negotiation) with different game sampling strategies.
Author: Jinyu Xiang
Code: github.com/XiangJinyu/textarena-rl
Checkpoints
| Directory | Strategy | Steps | Best Result |
|---|---|---|---|
./ (root) |
E3: Inverse Win Rate | 199 | GSM8K 78% (ckpt-100) |
e2-uniform/ |
E2: Uniform Random | 199 (effective) | KP 37.5%, GSM8K 74% |
e4-learning-progress/ |
E4: Learning Progress | 199 | TTT 65%, MATH500 53% |
Results
Game Win Rates (vs Gemini-2.0-Flash-Lite)
| Game | Base | E2 (Uniform) | E3 (Inv. WR) | E4 (Learn. Prog.) |
|---|---|---|---|---|
| KuhnPoker | 26.2% | 37.5% | 27.6% | 23.5% |
| TicTacToe | 41.5% | 61.5% | 56.8% | 65.0% |
| SimpleNeg. | 21.9% | 13.3% | 22.3% | 25.9% |
Math Benchmark Transfer (no math training data)
| Benchmark | Base | E2 (best) | E3 (best) | E4 (best) |
|---|---|---|---|---|
| GSM8K | 61% | 74% (+13) | 78% (+17) | 74% (+13) |
| MATH500 | 41% | 49% (+8) | 48% (+7) | 53% (+12) |
Key Findings
- Adaptive > Uniform for balance: Uniform (E2) excels on structured games but significantly degrades on negotiation (-8.6pp). Adaptive strategies maintain all games at or above base level.
- Learning Progress (E4) is best overall: Highest TTT win rate (65.0%, +23.5pp), strongest MATH500 transfer (+12pp), and monotonically improving performance with no late-stage regression.
- Inverse Win Rate (E3) peaks earlier: Best GSM8K (78%, +17pp) at mid-training but shows mild late regression.
- Game self-play transfers to math: Up to +17pp GSM8K and +12pp MATH500 without any math training data, matching SPIRAL's results with a 2.3x smaller model.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B-Base")
# E4 (Learning Progress) — best overall
model_e4 = PeftModel.from_pretrained(
base,
"XiangJinYu/Qwen3-1.7B-GameRL-AdaptiveCurriculum",
subfolder="e4-learning-progress"
)
# E3 (Inverse Win Rate) — best GSM8K
model_e3 = PeftModel.from_pretrained(
base,
"XiangJinYu/Qwen3-1.7B-GameRL-AdaptiveCurriculum"
)
# E2 (Uniform) — baseline comparison
model_e2 = PeftModel.from_pretrained(
base,
"XiangJinYu/Qwen3-1.7B-GameRL-AdaptiveCurriculum",
subfolder="e2-uniform"
)
Training Details
- Base model: Qwen/Qwen3-1.7B-Base
- Method: REINFORCE with Role-conditioned Advantage Estimation (RAE), LoRA rank=32, alpha=32
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Games: Kuhn Poker + TicTacToe + Simple Negotiation (TextArena)
- Framework: UnstableBaselines (pip install unstable-rl)
- Rollouts per step: 128 (E3/E4), 64 (E2)
- Learning rate: 1e-5
- Training steps: 200
- vLLM temperature: 0.6
- Max generation length: 4096 tokens
- Hardware: 3xA100 PCIe 80GB (E3/E4), 2xH100 (E2)
- Training time: ~3.3h (E3/E4)
Citation
@article{xiang2026adaptive-game-curriculum,
title={Adaptive Game Curriculum for Multi-Turn Self-Play Reinforcement Learning in Language Models},
author={Xiang, Jinyu},
year={2026},
url={https://github.com/XiangJinyu/textarena-rl}
}
Model tree for XiangJinYu/Qwen3-1.7B-GameRL-AdaptiveCurriculum
Base model
Qwen/Qwen3-1.7B-Base