Hi, BalatroBench creator here. Yeah, Google models perform well (I guess the long context + world knowledge capabilities). Opus 4.6 looks good on preliminary results (on par with Gemini 3 Pro). I'll add more models and report soon. Tbh, I didn't expect LLMs to start winning runs. I guess I have to move to harder stakes (e.g. red stake).
Thank you for the site! I've got a few suggestions:
1. I think winrate is more telling than the average round number.
2. Some runs are bugged (like Gemini's run 9) and should be excluded from the result. Selling Invisible Joker is always bugged, rendering all the runs with the seed EEEEEE invalid.
3. Instead of giving them "strategy" like "flush is the easiest hand..." it's fairer to clarify some mechanisms that confuse human players too. e.g. "played" vs "scored".
Especially, I think this kind of prompt gives LLM an unfair advantage and can skew the result:
> ### Antes 1-3: Foundation
> - *Priority*: One of your primary goals for this section of the game should be obtaining a solid Chips or Mult joker