In my project I rigged up an in-browser emulator and directly fed captured images of the screen to local multimodal models.

So it just looks right at what's going on, writes a description for refinement, and uses all of that to create and manage goals, write to a scratchpad and submit input. It's minimal scaffolding because I wanted to see what these raw models are capable of. Kind of a benchmark.

I have a feeling if you gave them access to GameFAQ guides they might be able to play better, but it depends on how you can feed them the data.

It turns out that cutting edge super small (3b param etc) models that fit in the browser are not great at playing Pokémon on an even basic level, even navigation is difficult when only providing raw visual information, and object recognition of the low-resolution sprites is not great. So I lost interest before even getting to the point of providing specific strategy.

But, it runs in browser and works with any supplied ROM, none of it is Pokémon-specific so I should set aside time to serve it and make the code available

You should consider publishing your setup as either a blog post or a GitHub repo. I think it could make for good benchmarking of smaller models. Ideally we can one day all run small models that can do amazing things.