To test would just need to edit the rom and switch around the solution. Not sure how complicated that is, likely depends on the rom system.
To test would just need to edit the rom and switch around the solution. Not sure how complicated that is, likely depends on the rom system.
I don't know why people still get wrapped around the axle of "training data".
Basically every benchmark worth it's salt uses bespoke problems purposely tuned to force the models to reason and generalize. It's the whole point of ARC-AGI tests.
Unsurprisingly Gemini 3 pro performs way better on ARC-AGI than 2.5 pro, and unsurprisingly it did much better in pokemon.
The benchmarks, by design, indicate you can mix up the switch puzzle pattern and it will still solve it.