ToolMerge planner — GRPO-finetuned Qwen3-VL-8B (step 50)

GRPO-finetuned planner from Qwen3-VL-8B-Instruct, used as the text-only query decomposer in the ToolMerge keyframe-retrieval pipeline.

Trained with TRL's GRPO trainer on Molmo-2 Moments (M2M) training data, optimizing the frames-in-GT + consistency reward at global_step=50.

Quick start

from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained("michalsr/toolmerge-planner-grpo")
model = AutoModelForCausalLM.from_pretrained(
    "michalsr/toolmerge-planner-grpo",
    torch_dtype="bfloat16",
)

To use inside ToolMerge, override the planner checkpoint at the CLI:

toolmerge config=configs/m2m/qwen3_8.yaml \
    model.base=michalsr/toolmerge-planner-grpo

Training recipe

Setting Value
Base model Qwen/Qwen3-VL-8B-Instruct
Reward frames_in_gt=1.0, consistency=1.0
Training data train_correct_uniform_8f_clip_max1.json (filtered M2M train split, ~1500 items)
Optimizer paged_adamw_8bit, lr=1e-6, bf16
Compute 2 nodes × 4 GPUs
Step global_step=50
Framework TRL 0.27.2, transformers 4.57.6, PyTorch 2.10.0

Full training config: training/configs/m2m_grpo.yaml in the ToolMerge repo.

Citation

@inproceedings{toolmerge2026,
  title     = {Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval},
  author    = {TODO},
  booktitle = {TODO},
  year      = {2026},
}

Cite the GRPO method:

@article{shao2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}

Code repo: https://github.com/michalsr/ToolMerge.

Downloads last month
9
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for michalsr/toolmerge-planner-grpo

Finetuned
(278)
this model

Paper for michalsr/toolmerge-planner-grpo