Qwen3-8B-RL — EpisodeBench Story Generation

This is Qwen3-8B fine-tuned with diversity-aware controllable reinforcement learning on the HeAAAAA/story_generation_rl dataset, released as part of EpisodeBench, a full-cycle benchmarking pipeline for long-form interactive story generation with controllable RL.

Among all Qwen3-8B variants reported in the paper, Qwen3-8B-RL achieves the best overall narrative-quality Total score (83.30) on the EpisodeBench test set under the calibrated Qwen3-8B-normal evaluator, while substantially improving episode-transition control, schema compliance, and pacing over the base model.


Why this model?

Strong general-purpose LLMs can produce locally fluent narrative text, yet still collapse pacing or miss valid episode transitions under long-form interactive storytelling. Under the same prompting setup, GPT-5-chat achieves only ~4% on-time transitions in our analysis.

EpisodeBench operationalizes long-form interactive storytelling as progression over an episode graph G = (E, T, B), so that:

  • the graph-derived next-episode label provides direct supervision for transition correctness,
  • the structured output schema provides supervision for JSON / format validity,
  • the interaction budget (T = 10 turns) provides a pacing contract.

Qwen3-8B-RL is trained directly against these graph-derived signals, without additional human preference annotation.


Task formulation

Each story is a directed graph G = (E, T, B) with global background B (story name, narrative style, description, characters), episode nodes E, and trigger-conditioned transitions T. Within an episode E_i, interaction is a sequence of message pairs M_i = [(u_1, a_1), …, (u_T, a_T)], with interaction budget T = 10.

Each assistant response is structured: a_t = (P_t, e_t) — a generated plot_list plus a predicted next_episode. A transition E_i → e_t is valid only if the generated continuation satisfies the corresponding trigger τ_{i→e_t}(P_t) = True.

The model must therefore generate coherent narrative content and make globally valid transition decisions at the right pace.


Training procedure

Algorithm — Diversity-aware Controllable RL

For the same episode state and user input, the policy samples a rollout group A_t = {a_t^(1), …, a_t^(G)} under identical graph constraints. Each candidate is scored along four channels:

  • R_acc — transition accuracy: whether the predicted next_episode matches the graph-derived reference successor. Directly supervisable from the episode graph; no human annotation required.
  • R_fmt — format / schema compliance: 0.4 · I(parseable JSON) + 0.6 · I(schema valid).
  • R_len — length control: rewards responses inside the target effective-length interval [L_min, L_max], with Gaussian decay outside.
  • R_div — DPP-style diversity: rewards semantically distinct continuations within a rollout group via the Schur-complement contribution of a regularized similarity kernel L = K + ηI on sentence-level embeddings.

Each channel is group-normalized and aggregated:

R(i) = λ_div · ŝ_div(i) + λ_fmt · ŝ_fmt(i) + λ_acc · ŝ_acc(i) + λ_len · ŝ_len(i)

This decomposition separates semantic exploration (R_div) from structural control (R_acc, R_fmt, R_len), keeping rollouts compatible with the structured episode graph while still encouraging diverse continuations.

Hyperparameters

Setting Value
Base model Qwen/Qwen3-8B
RL framework EasyR1 (GRPO-style group rollouts)
Rollouts per context 2 – 4
Sampling temperature 0.7, top-p 0.95
KL penalty (vs. reference policy) coefficient 0.02
Learning rate 5 × 10⁻⁶
Precision bf16 with gradient checkpointing
Max sequence length 7,000
Hardware 2 × NVIDIA H200
Wall-clock ≈ 16 hours

Training data

Trained on HeAAAAA/story_generation_rl — 22,233 turn-level interactive instances derived from 174 stories and 4,415 episodes, balanced across English (74.1%) and Chinese (25.9%), with normal / abnormal / hacking-style user behaviors.


Evaluation results

All results below are reported on the EpisodeBench test set with the calibrated Qwen3-8B-normal evaluator (see paper for the calibration protocol against 300 human-rated samples). Quality scores are 0–5 normalized to 0–100 (rollout@1 / avg@7); Acc@1, Pass@7, and JSON@7 are graph- and schema-derived metrics that are independent of automatic narrative-quality judgments.

Story generation quality (Qwen3-8B-normal evaluator)

Model Plot Guide Narr Char Trans Total Acc@1 Pass@7 JSON@7
Qwen3-8B (base) 78.83 / 79.13 73.50 / 73.78 87.97 / 88.31 86.15 / 86.16 80.67 / 80.97 82.54 / 82.74 70.75 90.53 93.49
Qwen3-8B-SFT 76.18 / 76.20 70.32 / 70.53 82.29 / 81.90 83.10 / 83.24 80.40 / 80.63 79.54 / 79.57 68.13 92.72 85.48
Qwen3-8B-RL (this model) 79.46 / 79.40 73.80 / 73.77 88.02 / 88.10 85.69 / 85.88 84.00 / 83.34 83.30 / 83.18 74.88 89.97 97.17
Qwen3-8B-SFT-RL 75.77 / 75.58 68.50 / 69.10 78.91 / 78.98 81.52 / 81.59 87.52 / 87.35 79.60 / 79.66 80.01 86.39 98.55

Qwen3-8B-RL improves over the base model on all five narrative-quality dimensions and on Trans / Acc@1 / JSON@7 simultaneously, achieving the highest Total among Qwen3-8B variants. Compared with Qwen3-8B-SFT-RL, it preserves stronger surface-level narrative-quality scores while still substantially raising structured controllability metrics.

Statistical significance (vs. Qwen3-8B base)

Paired sign-flip permutation tests on continuous reward / JSON metrics; McNemar's test on accuracy / pass-rate metrics; Benjamini–Hochberg FDR-adjusted q-values (* q < 0.05, ** q < 0.01, *** q < 0.001).

Evaluator Avg@K Total R1 Total Acc@1 Pass@K JSON Avg@K
Qwen3-8B-normal +0.44*** +0.76** +4.12*** −0.56 +3.68***
Qwen3-8B-exppos +0.47*** +0.82*** +4.12*** −0.56 +3.68***

Gains are statistically significant on Total quality, Acc@1, and JSON across both calibrated evaluators.

Episode-transition timing

EpisodeBench measures whether the model advances within the intended pacing window. With the interaction budget T = 10, transitions are bucketed into too fast (< 80% of T), on time (80–100%), too slow (100–150%), and failure (> 150%):

Model Too fast On time Too slow Failure
Qwen3-8B (base) 86.13% 4.96%
Qwen3-8B-RL 59.41% 11.87%
Qwen3-8B-SFT-RL 55.04% 13.00%

Qwen3-8B-RL reduces premature transitions by ~27 percentage points and more than doubles the on-time rate.

Note: Even with these gains, on-time rates remain modest in absolute terms — pacing control under a fixed ten-turn contract is still an open challenge. See "Limitations".


Intended use

Use Qwen3-8B-RL as a Content Completer that:

  1. Generates narrative content within an episode under a structured outline.
  2. Predicts the correct next episode according to the episode graph.
  3. Complies with a structured JSON output schema.
  4. Paces the episode within the interaction budget T = 10 (advancing in the late-episode window 80%–100% of the budget).

It is suitable for: research on long-form interactive storytelling, controllable role-playing, schema-guided generation, evaluator/RL studies, and as a baseline for further controllable-RL methods.


Inference

The model expects the EpisodeBench system prompt encoding the fixed structured outline (background B, episode goal G_goal, scene S_scene, valid triggers τ_i, interaction budget T = 10). The expected assistant output is a JSON object with plot_list and next_episode. See the codebase for the canonical prompt and parser.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "HeAAAAA/story_generation_Qwen3_8B_RL"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16", device_map="auto")

system_prompt = "<EpisodeBench structured outline: B, G_goal, S_scene, τ_i, T=10>"
user_input    = "<u_t: free-form user utterance>"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": user_input},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.95)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

The expected response schema is:

{
  "plot_list": [
    {
      "narrative": "<scene/action narration>",
      "role_dialogue": [
        {"name": "<character>", "utterance": "<line>"}
      ]
    }
  ],
  "next_episode": "<predicted successor episode id>"
}

For full rollout-style evaluation (Acc@1, Pass@7, JSON@7, transition-timing analysis) use the released evaluation pipeline in the GitHub repository.


Limitations

  • The model is trained on synthetic, source-guided trajectories under a fixed pacing contract (T = 10). It is optimized for structured progression control, not unconstrained literary creativity.
  • Improvements driven by EpisodeBench-derived RL are most pronounced on transition, pacing, and schema-following metrics; gains on subjective narrative-quality dimensions (Plot, Guidance, Narration, Character) are smaller and more evaluator-dependent.
  • On-time transition rate, while substantially improved over the base model, remains modest in absolute terms. Long-horizon pacing control is an open problem.
  • We do not systematically study adversarial or safety-relevant prompts. Lightweight, data-level heuristic corrections were applied to ill-formed user inputs during training, but dedicated safety mechanisms are out of scope.
  • Evaluator-based narrative-quality scores should be read as ranking-level, not absolute-score, measurements (cell-level Spearman ρ ≈ 0.78 vs. human ratings under Qwen3-8B-normal).

License

Released under CC BY 4.0 for research use with attribution. The base model is subject to its own license; please consult Qwen/Qwen3-8B for the upstream terms.

This model is intended for research on structured narrative progression, episode-transition control, pacing analysis, evaluator calibration, and controllable long-form generation. It is not intended for reconstructing original copyrighted stories, evaluating general literary merit, or deploying unrestricted interactive storytelling systems. Commercial use and redistribution of generated content for interactive deployment require additional review.


Citation

If you use this model, please cite the EpisodeBench paper:

@inproceedings{episodebench2026,
  title     = {EpisodeBench: A Full-Cycle Benchmarking Pipeline for Long-form Interactive Story Generation with Controllable RL},
  author    = {Anonymous},
  booktitle = {XX},
  year      = {2026},
  url       = {https://github.com/KaiHe-better/Longform_Interactive_Story_Generation}
}

Related releases

Resource Type Link
Story Generation SFT Dataset (episode-packed) HeAAAAA/story_generation_sft
Story Generation RL Dataset (turn-level) HeAAAAA/story_generation_rl
Qwen3-8B-RL (this model) Generator (RL only) HeAAAAA/story_generation_Qwen3_8B_RL
Reward Train: {Expneg, Exppos, Normal, Uniform} Judge training
Reward Test Judge testing
Human Ratings Evaluator calibration (300 items)
Downloads last month
27
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HeAAAAA/story_generation_Qwen3_8B_RL

Finetuned
Qwen/Qwen3-8B
Finetuned
(1606)
this model
Quantizations
2 models

Dataset used to train HeAAAAA/story_generation_Qwen3_8B_RL