2026¶

2026/02/01
in AI, Agents, Benchmark, Evaluation, Social Deduction
8 min read

Introduction

Purpose

Most LLM benchmarks sit in safe territory: code, math, or single-turn QA. Useful, but narrow. Social intelligence - deception, persuasion, coordination under uncertainty - is a different axis. Werewolf (Mafia) is a compact testbed for it, forcing agents to reason with hidden roles, influence votes, and adapt as information unfolds.

Problem Statement

Werewolf is a strong social-deduction benchmark because it forces agents to reason under hidden roles, persuade others in public dialogue, and adapt as partial information accumulates. A single misvote can swing the game, so decision quality, consistency, and role-specific play matter as much as raw win rate. Recent work like Werewolf Arena (2024) and WereWolf-Plus (2025) shows what this benchmark can capture and motivates a more reproducible, community-friendly evaluation stack. (Papers: Werewolf Arena, WereWolf-Plus)

What you'll learn

How Werewolf tests social reasoning and why it exposes behavior that win-rate-only metrics miss.
What recent papers contribute and where they stop short.
Why AgentBeats needs a reproducible harness with controlled baselines and clear submission flow.
How evaluator/assessee separation works in this benchmark.
Which metrics matter when you care about decision quality, not just outcomes.

Werewolf Arena Benchmark: An Agentic Social Deduction Benchmark

Introduction

What you'll learn