interesting, which model were you using for the vision part? In my experience Claude Sonnet and Opus handle UI screenshots reasonably well, not perfect but good enough that the agent can catch obvious layout issues and iterate. Definitely not at the “pixel perfect design implementation” stage yet though. But for testing features it's ok. The goal is for the agent to test that the UX/UI flow works, not that one pixel is correctly aligned with others in that case