So you can either wait for every application to do that, or at least make it possible for an LLM to do it… or you can make the LLM use a computer interface that works with every application by definition.

The middle ground would be leveraging e. g. standard a11y APIs, and/or hooking into applications like Squish does.

Then you get a nice textual world that fits the LLM without having to rewrite every application to have a fullblown HTTP server.