Instructions to use HeAAAAA/story_generation_Qwen3_8B_RL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HeAAAAA/story_generation_Qwen3_8B_RL with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="HeAAAAA/story_generation_Qwen3_8B_RL") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("HeAAAAA/story_generation_Qwen3_8B_RL") model = AutoModelForCausalLM.from_pretrained("HeAAAAA/story_generation_Qwen3_8B_RL") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HeAAAAA/story_generation_Qwen3_8B_RL with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HeAAAAA/story_generation_Qwen3_8B_RL" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HeAAAAA/story_generation_Qwen3_8B_RL", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/HeAAAAA/story_generation_Qwen3_8B_RL
- SGLang
How to use HeAAAAA/story_generation_Qwen3_8B_RL with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HeAAAAA/story_generation_Qwen3_8B_RL" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HeAAAAA/story_generation_Qwen3_8B_RL", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HeAAAAA/story_generation_Qwen3_8B_RL" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HeAAAAA/story_generation_Qwen3_8B_RL", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use HeAAAAA/story_generation_Qwen3_8B_RL with Docker Model Runner:
docker model run hf.co/HeAAAAA/story_generation_Qwen3_8B_RL
Qwen3-8B-RL — EpisodeBench Story Generation
This is Qwen3-8B fine-tuned with diversity-aware controllable reinforcement learning on the HeAAAAA/story_generation_rl dataset, released as part of EpisodeBench, a full-cycle benchmarking pipeline for long-form interactive story generation with controllable RL.
Among all Qwen3-8B variants reported in the paper, Qwen3-8B-RL achieves the best overall narrative-quality Total score (83.30) on the EpisodeBench test set under the calibrated Qwen3-8B-normal evaluator, while substantially improving episode-transition control, schema compliance, and pacing over the base model.
- 📄 Paper: EpisodeBench: A Full-Cycle Benchmarking Pipeline for Long-form Interactive Story Generation with Controllable RL
- 💻 Code: https://github.com/KaiHe-better/Longform_Interactive_Story_Generation
- 🗂️ Training data:
HeAAAAA/story_generation_rl - 🧱 Base model:
Qwen/Qwen3-8B
Why this model?
Strong general-purpose LLMs can produce locally fluent narrative text, yet still collapse pacing or miss valid episode transitions under long-form interactive storytelling. Under the same prompting setup, GPT-5-chat achieves only ~4% on-time transitions in our analysis.
EpisodeBench operationalizes long-form interactive storytelling as progression over an episode graph G = (E, T, B), so that:
- the graph-derived next-episode label provides direct supervision for transition correctness,
- the structured output schema provides supervision for JSON / format validity,
- the interaction budget (T = 10 turns) provides a pacing contract.
Qwen3-8B-RL is trained directly against these graph-derived signals, without additional human preference annotation.
Task formulation
Each story is a directed graph G = (E, T, B) with global background B (story name, narrative style, description, characters), episode nodes E, and trigger-conditioned transitions T. Within an episode E_i, interaction is a sequence of message pairs M_i = [(u_1, a_1), …, (u_T, a_T)], with interaction budget T = 10.
Each assistant response is structured: a_t = (P_t, e_t) — a generated plot_list plus a predicted next_episode. A transition E_i → e_t is valid only if the generated continuation satisfies the corresponding trigger τ_{i→e_t}(P_t) = True.
The model must therefore generate coherent narrative content and make globally valid transition decisions at the right pace.
Training procedure
Algorithm — Diversity-aware Controllable RL
For the same episode state and user input, the policy samples a rollout group A_t = {a_t^(1), …, a_t^(G)} under identical graph constraints. Each candidate is scored along four channels:
- R_acc — transition accuracy: whether the predicted
next_episodematches the graph-derived reference successor. Directly supervisable from the episode graph; no human annotation required. - R_fmt — format / schema compliance: 0.4 · I(parseable JSON) + 0.6 · I(schema valid).
- R_len — length control: rewards responses inside the target effective-length interval [L_min, L_max], with Gaussian decay outside.
- R_div — DPP-style diversity: rewards semantically distinct continuations within a rollout group via the Schur-complement contribution of a regularized similarity kernel L = K + ηI on sentence-level embeddings.
Each channel is group-normalized and aggregated:
R(i) = λ_div · ŝ_div(i) + λ_fmt · ŝ_fmt(i) + λ_acc · ŝ_acc(i) + λ_len · ŝ_len(i)
This decomposition separates semantic exploration (R_div) from structural control (R_acc, R_fmt, R_len), keeping rollouts compatible with the structured episode graph while still encouraging diverse continuations.
Hyperparameters
| Setting | Value |
|---|---|
| Base model | Qwen/Qwen3-8B |
| RL framework | EasyR1 (GRPO-style group rollouts) |
| Rollouts per context | 2 – 4 |
| Sampling | temperature 0.7, top-p 0.95 |
| KL penalty (vs. reference policy) | coefficient 0.02 |
| Learning rate | 5 × 10⁻⁶ |
| Precision | bf16 with gradient checkpointing |
| Max sequence length | 7,000 |
| Hardware | 2 × NVIDIA H200 |
| Wall-clock | ≈ 16 hours |
Training data
Trained on HeAAAAA/story_generation_rl — 22,233 turn-level interactive instances derived from 174 stories and 4,415 episodes, balanced across English (74.1%) and Chinese (25.9%), with normal / abnormal / hacking-style user behaviors.
Evaluation results
All results below are reported on the EpisodeBench test set with the calibrated Qwen3-8B-normal evaluator (see paper for the calibration protocol against 300 human-rated samples). Quality scores are 0–5 normalized to 0–100 (rollout@1 / avg@7); Acc@1, Pass@7, and JSON@7 are graph- and schema-derived metrics that are independent of automatic narrative-quality judgments.
Story generation quality (Qwen3-8B-normal evaluator)
| Model | Plot | Guide | Narr | Char | Trans | Total | Acc@1 | Pass@7 | JSON@7 |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3-8B (base) | 78.83 / 79.13 | 73.50 / 73.78 | 87.97 / 88.31 | 86.15 / 86.16 | 80.67 / 80.97 | 82.54 / 82.74 | 70.75 | 90.53 | 93.49 |
| Qwen3-8B-SFT | 76.18 / 76.20 | 70.32 / 70.53 | 82.29 / 81.90 | 83.10 / 83.24 | 80.40 / 80.63 | 79.54 / 79.57 | 68.13 | 92.72 | 85.48 |
| Qwen3-8B-RL (this model) | 79.46 / 79.40 | 73.80 / 73.77 | 88.02 / 88.10 | 85.69 / 85.88 | 84.00 / 83.34 | 83.30 / 83.18 | 74.88 | 89.97 | 97.17 |
| Qwen3-8B-SFT-RL | 75.77 / 75.58 | 68.50 / 69.10 | 78.91 / 78.98 | 81.52 / 81.59 | 87.52 / 87.35 | 79.60 / 79.66 | 80.01 | 86.39 | 98.55 |
Qwen3-8B-RL improves over the base model on all five narrative-quality dimensions and on Trans / Acc@1 / JSON@7 simultaneously, achieving the highest Total among Qwen3-8B variants. Compared with Qwen3-8B-SFT-RL, it preserves stronger surface-level narrative-quality scores while still substantially raising structured controllability metrics.
Statistical significance (vs. Qwen3-8B base)
Paired sign-flip permutation tests on continuous reward / JSON metrics; McNemar's test on accuracy / pass-rate metrics; Benjamini–Hochberg FDR-adjusted q-values (* q < 0.05, ** q < 0.01, *** q < 0.001).
| Evaluator | Avg@K Total | R1 Total | Acc@1 | Pass@K | JSON Avg@K |
|---|---|---|---|---|---|
| Qwen3-8B-normal | +0.44*** | +0.76** | +4.12*** | −0.56 | +3.68*** |
| Qwen3-8B-exppos | +0.47*** | +0.82*** | +4.12*** | −0.56 | +3.68*** |
Gains are statistically significant on Total quality, Acc@1, and JSON across both calibrated evaluators.
Episode-transition timing
EpisodeBench measures whether the model advances within the intended pacing window. With the interaction budget T = 10, transitions are bucketed into too fast (< 80% of T), on time (80–100%), too slow (100–150%), and failure (> 150%):
| Model | Too fast | On time | Too slow | Failure |
|---|---|---|---|---|
| Qwen3-8B (base) | 86.13% | 4.96% | — | — |
| Qwen3-8B-RL | 59.41% | 11.87% | — | — |
| Qwen3-8B-SFT-RL | 55.04% | 13.00% | — | — |
Qwen3-8B-RL reduces premature transitions by ~27 percentage points and more than doubles the on-time rate.
Note: Even with these gains, on-time rates remain modest in absolute terms — pacing control under a fixed ten-turn contract is still an open challenge. See "Limitations".
Intended use
Use Qwen3-8B-RL as a Content Completer that:
- Generates narrative content within an episode under a structured outline.
- Predicts the correct next episode according to the episode graph.
- Complies with a structured JSON output schema.
- Paces the episode within the interaction budget T = 10 (advancing in the late-episode window 80%–100% of the budget).
It is suitable for: research on long-form interactive storytelling, controllable role-playing, schema-guided generation, evaluator/RL studies, and as a baseline for further controllable-RL methods.
Inference
The model expects the EpisodeBench system prompt encoding the fixed structured outline (background B, episode goal G_goal, scene S_scene, valid triggers τ_i, interaction budget T = 10). The expected assistant output is a JSON object with plot_list and next_episode. See the codebase for the canonical prompt and parser.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "HeAAAAA/story_generation_Qwen3_8B_RL"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16", device_map="auto")
system_prompt = "<EpisodeBench structured outline: B, G_goal, S_scene, τ_i, T=10>"
user_input = "<u_t: free-form user utterance>"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_input},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.95)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
The expected response schema is:
{
"plot_list": [
{
"narrative": "<scene/action narration>",
"role_dialogue": [
{"name": "<character>", "utterance": "<line>"}
]
}
],
"next_episode": "<predicted successor episode id>"
}
For full rollout-style evaluation (Acc@1, Pass@7, JSON@7, transition-timing analysis) use the released evaluation pipeline in the GitHub repository.
Limitations
- The model is trained on synthetic, source-guided trajectories under a fixed pacing contract (T = 10). It is optimized for structured progression control, not unconstrained literary creativity.
- Improvements driven by EpisodeBench-derived RL are most pronounced on transition, pacing, and schema-following metrics; gains on subjective narrative-quality dimensions (Plot, Guidance, Narration, Character) are smaller and more evaluator-dependent.
- On-time transition rate, while substantially improved over the base model, remains modest in absolute terms. Long-horizon pacing control is an open problem.
- We do not systematically study adversarial or safety-relevant prompts. Lightweight, data-level heuristic corrections were applied to ill-formed user inputs during training, but dedicated safety mechanisms are out of scope.
- Evaluator-based narrative-quality scores should be read as ranking-level, not absolute-score, measurements (cell-level Spearman ρ ≈ 0.78 vs. human ratings under Qwen3-8B-normal).
License
Released under CC BY 4.0 for research use with attribution. The base model is subject to its own license; please consult Qwen/Qwen3-8B for the upstream terms.
This model is intended for research on structured narrative progression, episode-transition control, pacing analysis, evaluator calibration, and controllable long-form generation. It is not intended for reconstructing original copyrighted stories, evaluating general literary merit, or deploying unrestricted interactive storytelling systems. Commercial use and redistribution of generated content for interactive deployment require additional review.
Citation
If you use this model, please cite the EpisodeBench paper:
@inproceedings{episodebench2026,
title = {EpisodeBench: A Full-Cycle Benchmarking Pipeline for Long-form Interactive Story Generation with Controllable RL},
author = {Anonymous},
booktitle = {XX},
year = {2026},
url = {https://github.com/KaiHe-better/Longform_Interactive_Story_Generation}
}
Related releases
| Resource | Type | Link |
|---|---|---|
| Story Generation SFT | Dataset (episode-packed) | HeAAAAA/story_generation_sft |
| Story Generation RL | Dataset (turn-level) | HeAAAAA/story_generation_rl |
| Qwen3-8B-RL (this model) | Generator (RL only) | HeAAAAA/story_generation_Qwen3_8B_RL |
| Reward Train: {Expneg, Exppos, Normal, Uniform} | Judge training | – |
| Reward Test | Judge testing | – |
| Human Ratings | Evaluator calibration (300 items) | – |
- Downloads last month
- 27