The problem isn’t that AI cannot do this already. Because it can. The problem is cost.
There are people complaining that MCP costs too many tokens. Parsing the same UIs that humans do would just be insane.
And to be honest, I don’t think that’s the right goal of AI anyway. Most competent engineers prefer APIs to human interfaces. So why should machine-to-machine use that too?