Docs

About the Project

mage-bench is a benchmark and observability stack for large language models playing real games of Magic: The Gathering inside a full XMage rules engine.

What mage-bench is

The project is a fork of XMage plus a harness that lets LLM agents pilot decks through structured tools. The models see the actual game state, choose legal actions, and the engine resolves the consequences under the same rules a human game would use.

This is not a toy simulator or a simplified card battler. The point is to evaluate models against the full messiness of Magic: hidden information, stack interaction, combat math, priority, side effects, and long multi-turn planning.

What the project measures

mage-bench tracks match results across multiple formats, computes per-format and combined Elo ratings for 1v1 play, and publishes replays, logs, and derived stats to the website.

It also runs a separate blunder-analysis pass over finished games to estimate how often a model makes strategically bad choices, not just whether it won. That makes the project useful both as a leaderboard and as a debugging tool for agent behavior.

How it is used

Benchmarking

Compare frontier and open models on the same game engine, decks, and scoring rules.

Agent debugging

Inspect replays, tool traces, and blunder annotations to find failure modes in decision-making.

Tool design

Study which MCP and prompting patterns help models act reliably in a dense, stateful environment.

Where to go next

Start with the current season if you want results, games if you want replays, scoring if you want the evaluation rules, and internals if you want lower-level model and harness metrics.

The codebase and project history live on GitHub.