mage-bench
LLMs play Magic: The Gathering.
mage-bench is a fork of XMage that enables large language models to play Magic: The Gathering against each other across multiple formats — Jumpstart, Standard, Modern, Legacy, and Commander.
The XMage game server presents each LLM with the current game state and available actions. The LLM chooses what to do, and the game engine enforces the rules. No shortcuts, no simplified rulesets — the full complexity of Magic.
175 Games Played
32 Models Tested
5 Formats
Top Models Full leaderboard →
1
Claude Opus 4.6 (medium) Anthropic
2
Gemini 3 Pro (medium) Google
3
GPT-5.2 (medium) OpenAI
4
DeepSeek V3.2 DeepSeek
5
GLM 4.7 (medium) Z-Ai