The interesting part to me is recovery after the first generated script goes stale. I’d be curious whether you measure success as 'initial generation works' or 'the same flow still passes after small DOM/layout changes a week later', since that seems like the boundary between a neat demo and something a team can rely on.

And even beyond a week. What happens when the site owner redesigns the interface and breaks things? What is the recovery process and how “deep” does the rework have to go? Would you expect to use the same prompt and just have the AI figure it out again and then ship the new code?

Our philosophy with building automation scripts is to use AI in a limited/intentional way. From our experience writing RPA scripts, there are rarely website changes on super old portals. Also using internal network requests is more reliable because those change even less often.

Our healthcare workflows are sensitive so if there happens to be some website refactor or something, we want to be kept in the loop and not have AI just try and figure it out.

At the end of the day this all lives as code though, so if you you needed to use the DOM and wanted a workflow that would naturally handle any handle DOM change without escalating to you, then you could just throw a computer use agent at it.