Instructions to use HeAAAAA/story_generation_Qwen3_8B_RL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HeAAAAA/story_generation_Qwen3_8B_RL with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HeAAAAA/story_generation_Qwen3_8B_RL")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HeAAAAA/story_generation_Qwen3_8B_RL")
model = AutoModelForCausalLM.from_pretrained("HeAAAAA/story_generation_Qwen3_8B_RL")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HeAAAAA/story_generation_Qwen3_8B_RL with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HeAAAAA/story_generation_Qwen3_8B_RL"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HeAAAAA/story_generation_Qwen3_8B_RL",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/HeAAAAA/story_generation_Qwen3_8B_RL

SGLang

How to use HeAAAAA/story_generation_Qwen3_8B_RL with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HeAAAAA/story_generation_Qwen3_8B_RL" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HeAAAAA/story_generation_Qwen3_8B_RL",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HeAAAAA/story_generation_Qwen3_8B_RL" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HeAAAAA/story_generation_Qwen3_8B_RL",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use HeAAAAA/story_generation_Qwen3_8B_RL with Docker Model Runner:
```
docker model run hf.co/HeAAAAA/story_generation_Qwen3_8B_RL
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Qwen3-8B-RL — EpisodeBench Story Generation

This is Qwen3-8B fine-tuned with diversity-aware controllable reinforcement learning on the HeAAAAA/story_generation_rl dataset, released as part of EpisodeBench, a full-cycle benchmarking pipeline for long-form interactive story generation with controllable RL.

Among all Qwen3-8B variants reported in the paper, Qwen3-8B-RL achieves the best overall narrative-quality Total score (83.30) on the EpisodeBench test set under the calibrated Qwen3-8B-normal evaluator, while substantially improving episode-transition control, schema compliance, and pacing over the base model.

📄 Paper: EpisodeBench: A Full-Cycle Benchmarking Pipeline for Long-form Interactive Story Generation with Controllable RL
💻 Code: https://github.com/KaiHe-better/Longform_Interactive_Story_Generation
🗂️ Training data: HeAAAAA/story_generation_rl
🧱 Base model: Qwen/Qwen3-8B

Why this model?

Strong general-purpose LLMs can produce locally fluent narrative text, yet still collapse pacing or miss valid episode transitions under long-form interactive storytelling. Under the same prompting setup, GPT-5-chat achieves only ~4% on-time transitions in our analysis.

EpisodeBench operationalizes long-form interactive storytelling as progression over an episode graph G = (E, T, B), so that:

the graph-derived next-episode label provides direct supervision for transition correctness,
the structured output schema provides supervision for JSON / format validity,
the interaction budget (T = 10 turns) provides a pacing contract.

Qwen3-8B-RL is trained directly against these graph-derived signals, without additional human preference annotation.

Task formulation

Each story is a directed graph G = (E, T, B) with global background B (story name, narrative style, description, characters), episode nodes E, and trigger-conditioned transitions T. Within an episode E_i, interaction is a sequence of message pairs M_i = [(u_1, a_1), …, (u_T, a_T)], with interaction budget T = 10.

Each assistant response is structured: a_t = (P_t, e_t) — a generated plot_list plus a predicted next_episode. A transition E_i → e_t is valid only if the generated continuation satisfies the corresponding trigger τ_{i→e_t}(P_t) = True.

The model must therefore generate coherent narrative content and make globally valid transition decisions at the right pace.

Training procedure

Algorithm — Diversity-aware Controllable RL

For the same episode state and user input, the policy samples a rollout group A_t = {a_t^(1), …, a_t^(G)} under identical graph constraints. Each candidate is scored along four channels:

R_acc — transition accuracy: whether the predicted next_episode matches the graph-derived reference successor. Directly supervisable from the episode graph; no human annotation required.
R_fmt — format / schema compliance: 0.4 · I(parseable JSON) + 0.6 · I(schema valid).
R_len — length control: rewards responses inside the target effective-length interval [L_min, L_max], with Gaussian decay outside.
R_div — DPP-style diversity: rewards semantically distinct continuations within a rollout group via the Schur-complement contribution of a regularized similarity kernel L = K + ηI on sentence-level embeddings.

Each channel is group-normalized and aggregated:

R(i) = λ_div · ŝ_div(i) + λ_fmt · ŝ_fmt(i) + λ_acc · ŝ_acc(i) + λ_len · ŝ_len(i)

This decomposition separates semantic exploration (R_div) from structural control (R_acc, R_fmt, R_len), keeping rollouts compatible with the structured episode graph while still encouraging diverse continuations.

Hyperparameters

Setting	Value
Base model	`Qwen/Qwen3-8B`
RL framework	EasyR1 (GRPO-style group rollouts)
Rollouts per context	2 – 4
Sampling	temperature 0.7, top-p 0.95
KL penalty (vs. reference policy)	coefficient 0.02
Learning rate	5 × 10⁻⁶
Precision	bf16 with gradient checkpointing
Max sequence length	7,000
Hardware	2 × NVIDIA H200
Wall-clock	≈ 16 hours

Training data

Trained on HeAAAAA/story_generation_rl — 22,233 turn-level interactive instances derived from 174 stories and 4,415 episodes, balanced across English (74.1%) and Chinese (25.9%), with normal / abnormal / hacking-style user behaviors.

Evaluation results

All results below are reported on the EpisodeBench test set with the calibrated Qwen3-8B-normal evaluator (see paper for the calibration protocol against 300 human-rated samples). Quality scores are 0–5 normalized to 0–100 (rollout@1 / avg@7); Acc@1, Pass@7, and JSON@7 are graph- and schema-derived metrics that are independent of automatic narrative-quality judgments.

Story generation quality (Qwen3-8B-normal evaluator)

Model	Plot	Guide	Narr	Char	Trans	Total	Acc@1	Pass@7	JSON@7
Qwen3-8B (base)	78.83 / 79.13	73.50 / 73.78	87.97 / 88.31	86.15 / 86.16	80.67 / 80.97	82.54 / 82.74	70.75	90.53	93.49
Qwen3-8B-SFT	76.18 / 76.20	70.32 / 70.53	82.29 / 81.90	83.10 / 83.24	80.40 / 80.63	79.54 / 79.57	68.13	92.72	85.48
Qwen3-8B-RL (this model)	79.46 / 79.40	73.80 / 73.77	88.02 / 88.10	85.69 / 85.88	84.00 / 83.34	83.30 / 83.18	74.88	89.97	97.17
Qwen3-8B-SFT-RL	75.77 / 75.58	68.50 / 69.10	78.91 / 78.98	81.52 / 81.59	87.52 / 87.35	79.60 / 79.66	80.01	86.39	98.55

Qwen3-8B-RL improves over the base model on all five narrative-quality dimensions and on Trans / Acc@1 / JSON@7 simultaneously, achieving the highest Total among Qwen3-8B variants. Compared with Qwen3-8B-SFT-RL, it preserves stronger surface-level narrative-quality scores while still substantially raising structured controllability metrics.

Statistical significance (vs. Qwen3-8B base)

Paired sign-flip permutation tests on continuous reward / JSON metrics; McNemar's test on accuracy / pass-rate metrics; Benjamini–Hochberg FDR-adjusted q-values (* q < 0.05, ** q < 0.01, *** q < 0.001).

Evaluator	Avg@K Total	R1 Total	Acc@1	Pass@K	JSON Avg@K
Qwen3-8B-normal	+0.44***	+0.76**	+4.12***	−0.56	+3.68***
Qwen3-8B-exppos	+0.47***	+0.82***	+4.12***	−0.56	+3.68***

Gains are statistically significant on Total quality, Acc@1, and JSON across both calibrated evaluators.

Episode-transition timing

EpisodeBench measures whether the model advances within the intended pacing window. With the interaction budget T = 10, transitions are bucketed into too fast (< 80% of T), on time (80–100%), too slow (100–150%), and failure (> 150%):

Model	Too fast	On time	Too slow	Failure
Qwen3-8B (base)	86.13%	4.96%	—	—
Qwen3-8B-RL	59.41%	11.87%	—	—
Qwen3-8B-SFT-RL	55.04%	13.00%	—	—

Qwen3-8B-RL reduces premature transitions by ~27 percentage points and more than doubles the on-time rate.

Note: Even with these gains, on-time rates remain modest in absolute terms — pacing control under a fixed ten-turn contract is still an open challenge. See "Limitations".

Intended use

Use Qwen3-8B-RL as a Content Completer that:

Generates narrative content within an episode under a structured outline.
Predicts the correct next episode according to the episode graph.
Complies with a structured JSON output schema.
Paces the episode within the interaction budget T = 10 (advancing in the late-episode window 80%–100% of the budget).

It is suitable for: research on long-form interactive storytelling, controllable role-playing, schema-guided generation, evaluator/RL studies, and as a baseline for further controllable-RL methods.

Inference

The model expects the EpisodeBench system prompt encoding the fixed structured outline (background B, episode goal G_goal, scene S_scene, valid triggers τ_i, interaction budget T = 10). The expected assistant output is a JSON object with plot_list and next_episode. See the codebase for the canonical prompt and parser.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "HeAAAAA/story_generation_Qwen3_8B_RL"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16", device_map="auto")

system_prompt = "<EpisodeBench structured outline: B, G_goal, S_scene, τ_i, T=10>"
user_input    = "<u_t: free-form user utterance>"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": user_input},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.95)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

The expected response schema is:

{
  "plot_list": [
    {
      "narrative": "<scene/action narration>",
      "role_dialogue": [
        {"name": "<character>", "utterance": "<line>"}
      ]
    }
  ],
  "next_episode": "<predicted successor episode id>"
}

For full rollout-style evaluation (Acc@1, Pass@7, JSON@7, transition-timing analysis) use the released evaluation pipeline in the GitHub repository.

Limitations

The model is trained on synthetic, source-guided trajectories under a fixed pacing contract (T = 10). It is optimized for structured progression control, not unconstrained literary creativity.
Improvements driven by EpisodeBench-derived RL are most pronounced on transition, pacing, and schema-following metrics; gains on subjective narrative-quality dimensions (Plot, Guidance, Narration, Character) are smaller and more evaluator-dependent.
On-time transition rate, while substantially improved over the base model, remains modest in absolute terms. Long-horizon pacing control is an open problem.
We do not systematically study adversarial or safety-relevant prompts. Lightweight, data-level heuristic corrections were applied to ill-formed user inputs during training, but dedicated safety mechanisms are out of scope.
Evaluator-based narrative-quality scores should be read as ranking-level, not absolute-score, measurements (cell-level Spearman ρ ≈ 0.78 vs. human ratings under Qwen3-8B-normal).

License

Released under CC BY 4.0 for research use with attribution. The base model is subject to its own license; please consult Qwen/Qwen3-8B for the upstream terms.

This model is intended for research on structured narrative progression, episode-transition control, pacing analysis, evaluator calibration, and controllable long-form generation. It is not intended for reconstructing original copyrighted stories, evaluating general literary merit, or deploying unrestricted interactive storytelling systems. Commercial use and redistribution of generated content for interactive deployment require additional review.

Citation

If you use this model, please cite the EpisodeBench paper:

@inproceedings{episodebench2026,
  title     = {EpisodeBench: A Full-Cycle Benchmarking Pipeline for Long-form Interactive Story Generation with Controllable RL},
  author    = {Anonymous},
  booktitle = {XX},
  year      = {2026},
  url       = {https://github.com/KaiHe-better/Longform_Interactive_Story_Generation}
}

Related releases

Resource	Type	Link
Story Generation SFT	Dataset (episode-packed)	`HeAAAAA/story_generation_sft`
Story Generation RL	Dataset (turn-level)	`HeAAAAA/story_generation_rl`
Qwen3-8B-RL (this model)	Generator (RL only)	`HeAAAAA/story_generation_Qwen3_8B_RL`
Reward Train: {Expneg, Exppos, Normal, Uniform}	Judge training	–
Reward Test	Judge testing	–
Human Ratings	Evaluator calibration (300 items)	–

Downloads last month: 27

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for HeAAAAA/story_generation_Qwen3_8B_RL

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B