Werewolf Arena Benchmark: An Agentic Social Deduction Benchmark
Introduction
Purpose
Most LLM benchmarks sit in safe territory: code, math, or single-turn QA. Useful, but narrow. Social intelligence - deception, persuasion, coordination under uncertainty - is a different axis. Werewolf (Mafia) is a compact testbed for it, forcing agents to reason with hidden roles, influence votes, and adapt as information unfolds.
Problem Statement
Werewolf is a strong social-deduction benchmark because it forces agents to reason under hidden roles, persuade others in public dialogue, and adapt as partial information accumulates. A single misvote can swing the game, so decision quality, consistency, and role-specific play matter as much as raw win rate. Recent work like Werewolf Arena (2024) and WereWolf-Plus (2025) shows what this benchmark can capture and motivates a more reproducible, community-friendly evaluation stack. (Papers: Werewolf Arena, WereWolf-Plus)
What you'll learn
- How Werewolf tests social reasoning and why it exposes behavior that win-rate-only metrics miss.
- What recent papers contribute and where they stop short.
- Why AgentBeats needs a reproducible harness with controlled baselines and clear submission flow.
- How evaluator/assessee separation works in this benchmark.
- Which metrics matter when you care about decision quality, not just outcomes.