Seems trivial to diff multiple screenshots to identify what parts move. Or just use a compression algorithm to do the same.

Would 2 screenshots be enough, I wonder?

Yeah, the letters are big enough, an xor shows the text quite clearly.