I was thinking you could formally benchmark decks against each other enmasse. MTG is not my wheelhouse, but with YGO at least deck power is determined by frequency of use and placement at official tournaments. Imagine taking any permutation of cards, including undiscovered/untested ones, and simulating a vast amount of games in parallel.

Of course when you quantize deck quality to such a degree I'd argue it's not fun anymore. YGO is already not fun anymore because of this rampant quantization and it didn't even take LLMs to arrive here.

Why would you use LLMs at all for that, can’t you just Monte Carlo this thing and be done with it?

You still need an algorithm to decide, for each game that you're simulating, what actual decisions get made. If that algorithm is dumb, then you might decide Mono-Red Burn is the best deck, not because it's the best deck but because the dumb algorithm can play Burn much better than it can play Storm, inflating Burn's win rate.

In principle, LLMs could have a much higher strategy ceiling than deterministic decision-tree-style AIs. But my experience with mage-bench is that LLMs are probably not good enough to outperform even very basic decision-tree AIs today.

Um obviously the Monte Carlo results would be use to generate utility AI scoring functions to determine the best card to use for different considerations. Have the people building these LLM AI systems even had experience with classical AIs!? This is a solved problem, the LLM solution is slow, expensive, and energy inefficient.

Worse, it’s difficult to tweak. For example, what if you want AIs that play at varying difficulties? Are you just gonna prompt the LLM “hey try to be kinda shitty at this but still somewhat good”?