As a former competitive MtG player this is really exciting to me.

That said, I reviewed a few of the Legacy games (the format I'm most familiar with and also the hardest by far), and the level of play was so low that I don't think any of the results are valid. It's very possible for Legacy they would need some assistance for playing Blue decks, but they seem to not be able to know the most basic of concepts - Who's the beatdown?.

IMO the most important pars of current competitive Magic is mulligans and that's something an LLM should be extremely good at but none of the games I'm seeing had either player starting with less than 7 cards... in my experience about 75% of games in Legacy have at least one player mulligan their opener.

Yeah, the intention here is not to answer "which deck is best" - the standard of play is nowhere near high enough for that. It's meant as more of a non-saturated benchmark for different LLM models, so you can say things like "Grok plays as well as a 7-year-old, whereas Opus is a true frontier model and plays as well as a 9-year-old". I'm optimistic that with continued improvements to the harness and new model releases we can get to at least "official Pro Tour stream commentator" skill levels within the next few years.

Hmm well, from my perspective, none of them are even really playing the game, they are just taking random actions. Any human, even a small child, would be much better.

And re: ages, it's worth noting that the youngest player to make Day 2 of a Grand Prix is 8 years old, and the youngest Pro Tour winner was 15 years old. I don't think it's realistic to get an LLM anywhere close to either of those players in skill level, though it's absolutely possible with a specialized model.

> , so you can say things like "Grok plays as well as a 7-year-old, whereas Opus is a true frontier model and plays as well as a 9-year-old".

no, no, no.. please think. Human child psychology is not the same as an LLM engine rating. It is both inaccurate and destructive to actual understanding to say that common phrase. Asking politely - consider not saying that about LLM game ratings.

The model does not seem to get told what the cards actually do, it's just trying to guess.

Probably less of a problem for Legacy, but it isn't even told what format it's playing. So I expect it's easy to get the bots to do far better than this by giving them more information.

Let me analyze this opening hand:

*Hand analysis:* - 3 lands (Riverpyre Verge, Steam Vents, Willowrush Verge) - dual lands which is good - Monument to Endurance ({3}) - a 3-drop - Accumulate Wisdom ({1}{U}) - 2-drop, likely a blue advantage engine - Boomerang Basics ({U}) - 1-drop blue spell - Gran-Gran ({U}) - 1-drop blue creature (1/2)

This is actually a reasonable hand. I have: - 3 lands giving me mana access (with color fixing) - A 1-drop (Gran-Gran or Boomerang Basics to choose from) - A 2-drop (Accumulate Wisdom) - A 3-drop (Monument to Endurance)

The curve is solid. I have some interactive blue play with Boomerang Basics and early board presence with Gran-Gran. The mana base supports blue well with all three lands. This is keep-able — while not spectacular, it's a functional opening hand.

{ "hand_size": 7, "action_type": "GAME_ASK", "players": "GPT52 Vet(20hp), Haiku Worrier(you,20hp)", "land_count": 3, "context": "T1 ()", "response_type": "boolean", "respond_with": "answer=true or answer=false", "message": "Mulligan down to 6 cards?", "your_hand": [ { "name": "Riverpyre Verge", "is_land": true }, { "name": "Steam Vents", "is_land": true }, { "name": "Willowrush Verge", "is_land": true }, { "mana_cost": "{3}", "name": "Monument to Endurance" }, { "mana_cost": "{1}{U}", "name": "Accumulate Wisdom" }, { "mana_cost": "{U}", "name": "Boomerang Basics" }, { "mana_cost": "{U}", "name": "Gran-Gran", "power": "1", "toughness": "2" } ], "action_pending": true }

I picked a random commander game and the first message 1/4 players left was "Well I should be getting a new hand soon because I have asked for a mulligan". Its definitely in there, whether they are using it correctly is another question.