I don't quite get how diffing frames allows you to find the scores.

TFA mentions comparing a frame with and without - but how do you generate that frame without? If you can already do it, what's useful about doing that?

He's diffing the frames, and then the only pixels that stay the same are the UI, from which he doesn't directly get the UI (see the example, it's illegible) but he can extract the POSITION of the UI on the screen by finding all the non-red pixels.

And then he does a good ol' regular crop on the original image to get the UI excerpt to feed the vision model.

I think the text is wrong, it's diffing two frames and the areas that are the same are where the scorebaord is as this doesn't change between frames but everything else does.

I was also confused by this. I think you're right, but in the original text they specifically mention a 'static background' that they remove, so it's not just a simple 'wrong way round' error, it's a fundamental misunderstanding of what's happening. Makes me wonder if the author actually knew what they were doing, or just using an LLM to vibe-code everything.