The stale state problem is real and underappreciated. I've been running browser automation through OpenClaw and the failure modes you describe — modal appears after screenshot, dropdown covers the target element — are exactly what causes silent failures that are hard to debug. The agent "succeeds" from its perspective because it acted on the last known state.

The freeze-then-capture approach is interesting. Curious how it handles pages with aggressive anti-bot detection that fingerprints headless Chromium forks — that's the other failure mode I keep hitting.

Right now, it's evading all anti-botting detectors I've tested it on. I believe it's due to the fact it runs in headful mode and I've removed all detectable CDP signatures. Input events are also simulated at a system level (typing is at 200 WPM) so it's very hard for a page's javascript to detect it's not in a human operated chrome. A lot of detection on headless happens due to the webGPU capabilities being disabled since a modern computer is very unlikely to not support those. You could also wire up one of the Heretic models as a dedicated Captcha solver, I recommend Qwen 3.5 27b Heretic! https://huggingface.co/coder3101/Qwen3.5-27B-heretic