MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models Paper • 2401.16745 • Published Jan 30, 2024 • 1
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models Paper • 2310.20410 • Published Oct 31, 2023
M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models Paper • 2310.19240 • Published Oct 30, 2023
ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction Paper • 2508.12685 • Published Aug 18, 2025 • 1
ToolACE-MCP: Generalizing History-Aware Routing from MCP Tools to the Agent Web Paper • 2601.08276 • Published Jan 13 • 7
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL Paper • 2605.18703 • Published 4 days ago • 46
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL Paper • 2605.18703 • Published 4 days ago • 46
Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents Paper • 2512.20092 • Published Dec 23, 2025 • 9