Qwen3-1.7B-GameRL-AdaptiveCurriculum

LoRA adapters trained via multi-game self-play RL on TextArena environments (Kuhn Poker, TicTacToe, Simple Negotiation) with different game sampling strategies.

Author: Jinyu Xiang

Code: github.com/XiangJinyu/textarena-rl

Checkpoints

Directory	Strategy	Steps	Best Result
`./` (root)	E3: Inverse Win Rate	199	GSM8K 78% (ckpt-100)
`e2-uniform/`	E2: Uniform Random	199 (effective)	KP 37.5%, GSM8K 74%
`e4-learning-progress/`	E4: Learning Progress	199	TTT 65%, MATH500 53%

Results

Game Win Rates (vs Gemini-2.0-Flash-Lite)

Game	Base	E2 (Uniform)	E3 (Inv. WR)	E4 (Learn. Prog.)
KuhnPoker	26.2%	37.5%	27.6%	23.5%
TicTacToe	41.5%	61.5%	56.8%	65.0%
SimpleNeg.	21.9%	13.3%	22.3%	25.9%

Math Benchmark Transfer (no math training data)

Benchmark	Base	E2 (best)	E3 (best)	E4 (best)
GSM8K	61%	74% (+13)	78% (+17)	74% (+13)
MATH500	41%	49% (+8)	48% (+7)	53% (+12)

Key Findings

Adaptive > Uniform for balance: Uniform (E2) excels on structured games but significantly degrades on negotiation (-8.6pp). Adaptive strategies maintain all games at or above base level.
Learning Progress (E4) is best overall: Highest TTT win rate (65.0%, +23.5pp), strongest MATH500 transfer (+12pp), and monotonically improving performance with no late-stage regression.
Inverse Win Rate (E3) peaks earlier: Best GSM8K (78%, +17pp) at mid-training but shows mild late regression.
Game self-play transfers to math: Up to +17pp GSM8K and +12pp MATH500 without any math training data, matching SPIRAL's results with a 2.3x smaller model.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B-Base")

# E4 (Learning Progress) — best overall
model_e4 = PeftModel.from_pretrained(
    base,
    "XiangJinYu/Qwen3-1.7B-GameRL-AdaptiveCurriculum",
    subfolder="e4-learning-progress"
)

# E3 (Inverse Win Rate) — best GSM8K
model_e3 = PeftModel.from_pretrained(
    base,
    "XiangJinYu/Qwen3-1.7B-GameRL-AdaptiveCurriculum"
)

# E2 (Uniform) — baseline comparison
model_e2 = PeftModel.from_pretrained(
    base,
    "XiangJinYu/Qwen3-1.7B-GameRL-AdaptiveCurriculum",
    subfolder="e2-uniform"
)

Training Details

Base model: Qwen/Qwen3-1.7B-Base
Method: REINFORCE with Role-conditioned Advantage Estimation (RAE), LoRA rank=32, alpha=32
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Games: Kuhn Poker + TicTacToe + Simple Negotiation (TextArena)
Framework: UnstableBaselines (pip install unstable-rl)
Rollouts per step: 128 (E3/E4), 64 (E2)
Learning rate: 1e-5
Training steps: 200
vLLM temperature: 0.6
Max generation length: 4096 tokens
Hardware: 3xA100 PCIe 80GB (E3/E4), 2xH100 (E2)
Training time: ~3.3h (E3/E4)

Citation

@article{xiang2026adaptive-game-curriculum,
  title={Adaptive Game Curriculum for Multi-Turn Self-Play Reinforcement Learning in Language Models},
  author={Xiang, Jinyu},
  year={2026},
  url={https://github.com/XiangJinyu/textarena-rl}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for XiangJinYu/Qwen3-1.7B-GameRL-AdaptiveCurriculum

Base model

Qwen/Qwen3-1.7B-Base

Adapter

(36)

this model