this is actually a very valid technique. We do the same (as an rl environments provider).

Except we bundle it with a custom browser renderer which actually generates rewards based on dom diff...and not screenshot based.

the browser renderer is opensource https://github.com/wootzapp/wootz-browser