This sounds very plausible. Arguably MCPs are already a step in that direction: give the LLMs a way to use services that is text-based and easy for them. Agents that look at your screen and click on menus are a cool but clumsy and very expensive intermediate step.

When I use telegram to talk to the OpenClaw instance in my spare Mac I am already choosing a new interface, over whatever was built by the designers of the apps it is using. Why keep the human-facing version as is? Why not make an agent-first interface (which will not involve having to "see" windows), and make a validation interface for the human minder?