Papers
arxiv:2605.27068

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Published on May 26
ยท Submitted by
Ye Yuan
on May 27
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A multimodal social reasoning environment and evaluation framework called QUACK is introduced to audit the grounding of agent language through three-level assessment of game outcomes, behavioral trajectories, and utterance-level consistency.

AI-generated summary

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.

Community

Paper author Paper submitter

Hi everyone! This is Ye Yuan from McGill University and Mila. I am excited to share our recent paper: QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents.

Social deduction games have become a popular testbed for evaluating reasoning, deception, coordination, and belief modeling in large language models. However, most environments evaluate agents primarily through game outcomes like win rates and largely remain restricted to text-only interactions. Because of this, it is difficult to determine whether an agent's language is actually grounded in what it perceived and did, or to systematically identify reasoning failures.

To address this gap, we built QUACK, an open-source environment and evaluation framework designed to audit the grounding of agent language in multimodal social reasoning.

Here is a quick overview of how QUACK works:

  • Multimodal & Partially Observable: Agents navigate configurable graph-based maps, complete location-bound tasks, and communicate under hidden-role adversarial incentives. They observe both rendered global and local views.
  • Fully Replayable Ground Truth: Every episode is serialized into structured engine-level event logs, yielding a tick-by-tick ground-truth trajectory for each agent.
  • Three-Tier Evaluation: We score agents at three levels: game outcomes (Tier 1), behavioral trajectories (Tier 2), and utterance-level consistency (Tier 3).
  • Statement Verification Pipeline: At the core of Tier 3, this pipeline reconstructs the ground-truth trajectory and checks every single discussion claim against the reconstructed world state.

We evaluated three frontier VLMs (GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7) across 270 games in both homogeneous and cross-model adversarial settings. The audit revealed that relying on win rates masks systematic reasoning failures. Specifically:

  • Spatial Hallucination: Even the strongest agents hallucinate 15.1% of their verifiable spatial claims.
  • Unsupported Accusations: Agents make over half (53.5%) of their accusations without any grounded supporting evidence.
  • Deception Collapse: When playing as the impostor (Duck), models exhibit a deception rate of 22.1%, meaning roughly a fifth of their verifiable claims are outright false. Furthermore, their deception sophistication is near zero, meaning they produce easily falsifiable lies rather than subtle alibis.
  • Language-Action Inconsistency: Agents frequently state activities or routes that directly conflict with their logged actions.

We have released the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK and raw logs at https://huggingface.co/datasets/5a-academia-attractions/QUACK.

I would love to hear the community's thoughts on this! Happy to answer any questions about the environment design or the verification pipeline!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.27068
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.27068 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.27068 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.