QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents
Abstract
A multimodal social reasoning environment and evaluation framework called QUACK is introduced to audit the grounding of agent language through three-level assessment of game outcomes, behavioral trajectories, and utterance-level consistency.
Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.
Community
Hi everyone! This is Ye Yuan from McGill University and Mila. I am excited to share our recent paper: QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents.
Social deduction games have become a popular testbed for evaluating reasoning, deception, coordination, and belief modeling in large language models. However, most environments evaluate agents primarily through game outcomes like win rates and largely remain restricted to text-only interactions. Because of this, it is difficult to determine whether an agent's language is actually grounded in what it perceived and did, or to systematically identify reasoning failures.
To address this gap, we built QUACK, an open-source environment and evaluation framework designed to audit the grounding of agent language in multimodal social reasoning.
Here is a quick overview of how QUACK works:
- Multimodal & Partially Observable: Agents navigate configurable graph-based maps, complete location-bound tasks, and communicate under hidden-role adversarial incentives. They observe both rendered global and local views.
- Fully Replayable Ground Truth: Every episode is serialized into structured engine-level event logs, yielding a tick-by-tick ground-truth trajectory for each agent.
- Three-Tier Evaluation: We score agents at three levels: game outcomes (Tier 1), behavioral trajectories (Tier 2), and utterance-level consistency (Tier 3).
- Statement Verification Pipeline: At the core of Tier 3, this pipeline reconstructs the ground-truth trajectory and checks every single discussion claim against the reconstructed world state.
We evaluated three frontier VLMs (GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7) across 270 games in both homogeneous and cross-model adversarial settings. The audit revealed that relying on win rates masks systematic reasoning failures. Specifically:
- Spatial Hallucination: Even the strongest agents hallucinate 15.1% of their verifiable spatial claims.
- Unsupported Accusations: Agents make over half (53.5%) of their accusations without any grounded supporting evidence.
- Deception Collapse: When playing as the impostor (Duck), models exhibit a deception rate of 22.1%, meaning roughly a fifth of their verifiable claims are outright false. Furthermore, their deception sophistication is near zero, meaning they produce easily falsifiable lies rather than subtle alibis.
- Language-Action Inconsistency: Agents frequently state activities or routes that directly conflict with their logged actions.
We have released the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK and raw logs at https://huggingface.co/datasets/5a-academia-attractions/QUACK.
I would love to hear the community's thoughts on this! Happy to answer any questions about the environment design or the verification pipeline!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems (2026)
- Revac: A Social Deduction Reasoning Agent (2026)
- PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments (2026)
- TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs (2026)
- Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games (2026)
- Hallucination as Exploit: Evidence-Carrying Multimodal Agents (2026)
- Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.27068 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper