Skip to content

AI

Werewolf Arena Benchmark: An Agentic Social Deduction Benchmark

Introduction

Purpose

Most LLM benchmarks sit in safe territory: code, math, or single-turn QA. Useful, but narrow. Social intelligence - deception, persuasion, coordination under uncertainty - is a different axis. Werewolf (Mafia) is a compact testbed for it, forcing agents to reason with hidden roles, influence votes, and adapt as information unfolds.

Problem Statement

Werewolf is a strong social-deduction benchmark because it forces agents to reason under hidden roles, persuade others in public dialogue, and adapt as partial information accumulates. A single misvote can swing the game, so decision quality, consistency, and role-specific play matter as much as raw win rate. Recent work like Werewolf Arena (2024) and WereWolf-Plus (2025) shows what this benchmark can capture and motivates a more reproducible, community-friendly evaluation stack. (Papers: Werewolf Arena, WereWolf-Plus)

What you'll learn

  • How Werewolf tests social reasoning and why it exposes behavior that win-rate-only metrics miss.
  • What recent papers contribute and where they stop short.
  • Why AgentBeats needs a reproducible harness with controlled baselines and clear submission flow.
  • How evaluator/assessee separation works in this benchmark.
  • Which metrics matter when you care about decision quality, not just outcomes.

BeatDebate — Technical Deep Dive

Introduction

Purpose

BeatDebate is a proof-of-concept web app that shows how a large-language-model (LLM) planner can orchestrate specialised agents to deliver transparent, long-tail music recommendations in under seven seconds.

Check out the project GitHub repository for the full code and detailed documentation. Here is the web application. Check out the AgentX course.

Problem Statement

Standard collaborative-filtering pipelines optimise clicks but amplify popularity bias and tell listeners nothing about why a song appears. BeatDebate flips the workflow: first an LLM writes an explicit machine-readable plan, then lightweight agents execute it, and finally a Judge agent converts plan weights into human-readable explanations.

What you’ll learn

  • Designing an LLM-planned recommender — externalising reasoning as JSON so downstream agents become cheap and debuggable.
  • Using LangGraph for agent orchestration — a typed DAG with retries, time-outs, and state-passing.
  • Balancing novelty and relevance with dual advocate agents (Genre-Mood vs. Discovery).
  • Generating explanations by design rather than post-hoc.
  • Running at interactive speed on commodity hardware for <$0.04 per query.

Concept Visualizer: An AI-Powered Design Tool - Technical Deep Dive

Introduction

Purpose

This blog post documents the technical architecture and implementation of the Concept Visualizer, a web application designed to help users generate and refine visual concepts like logos and color palettes using AI. We'll explore the journey from an idea described in text to a set of visual assets, powered by a modern cloud-native stack.

Check out the project GitHub repository for the full code and detailed documentation. Here is the web application.

From Reddit to Insights: Building an AI-Powered Data Pipeline with Gemini (Cloud)

Introduction

Purpose

In this blog post, I document the process of building an AI-driven, cloud data pipeline to automate this task. Using Google’s Gemini AI, the pipeline collects, processes, and synthesizes discussions from AI-related subreddits into structured daily reports. The system is designed to filter out irrelevant or harmful content, ensuring the extracted insights are both meaningful and actionable.

Check out the project GitHub repository for the full code and detailed documentation and Web Application.

From Reddit to Insights: Building an AI-Powered Data Pipeline with Gemini (On-Prem)

Introduction

Purpose

In this blog post, I document the process of building an AI-driven, on-premises data pipeline to automate this task. Using Google’s Gemini AI, the pipeline collects, processes, and synthesizes discussions from AI-related subreddits into structured daily reports. The system is designed to filter out irrelevant or harmful content, ensuring the extracted insights are both meaningful and actionable.

Check out the project GitHub repository for the full code and detailed documentation and Web Application.