arxiv:2605.27068

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Published on May 26

· Submitted by

Ye Yuan on May 27

AAAAA Academia Attractions

Upvote

Authors:

Ye Yuan ,

Abstract

A multimodal social reasoning environment and evaluation framework called QUACK is introduced to audit the grounding of agent language through three-level assessment of game outcomes, behavioral trajectories, and utterance-level consistency.

AI-generated summary

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.

View arXiv page View PDF GitHub 4 Add to collection

Community

stevenyuan666

Paper author Paper submitter about 13 hours ago

Hi everyone! This is Ye Yuan from McGill University and Mila. I am excited to share our recent paper: QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents.

Social deduction games have become a popular testbed for evaluating reasoning, deception, coordination, and belief modeling in large language models. However, most environments evaluate agents primarily through game outcomes like win rates and largely remain restricted to text-only interactions. Because of this, it is difficult to determine whether an agent's language is actually grounded in what it perceived and did, or to systematically identify reasoning failures.

To address this gap, we built QUACK, an open-source environment and evaluation framework designed to audit the grounding of agent language in multimodal social reasoning.

Here is a quick overview of how QUACK works:

Multimodal & Partially Observable: Agents navigate configurable graph-based maps, complete location-bound tasks, and communicate under hidden-role adversarial incentives. They observe both rendered global and local views.
Fully Replayable Ground Truth: Every episode is serialized into structured engine-level event logs, yielding a tick-by-tick ground-truth trajectory for each agent.
Three-Tier Evaluation: We score agents at three levels: game outcomes (Tier 1), behavioral trajectories (Tier 2), and utterance-level consistency (Tier 3).
Statement Verification Pipeline: At the core of Tier 3, this pipeline reconstructs the ground-truth trajectory and checks every single discussion claim against the reconstructed world state.

We evaluated three frontier VLMs (GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7) across 270 games in both homogeneous and cross-model adversarial settings. The audit revealed that relying on win rates masks systematic reasoning failures. Specifically:

Spatial Hallucination: Even the strongest agents hallucinate 15.1% of their verifiable spatial claims.
Unsupported Accusations: Agents make over half (53.5%) of their accusations without any grounded supporting evidence.
Deception Collapse: When playing as the impostor (Duck), models exhibit a deception rate of 22.1%, meaning roughly a fifth of their verifiable claims are outright false. Furthermore, their deception sophistication is near zero, meaning they produce easily falsifiable lies rather than subtle alibis.
Language-Action Inconsistency: Agents frequently state activities or routes that directly conflict with their logged actions.

We have released the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK and raw logs at https://huggingface.co/datasets/5a-academia-attractions/QUACK.

I would love to hear the community's thoughts on this! Happy to answer any questions about the environment design or the verification pipeline!