I've been saying this is coming for a long time, but my really smart SWE friend who is nevertheless not in the AI/ML space dismissed it as a stupid roundabout way of doing things. That software should just talk via APIs. No matter how much I argued regarding legacy software/websites and how much functionality is really only available through GUI, it seems some people are really put off by this type of approach. To me, who is more embedded in the AI, computer vision, robotics world, the fuzziness of day-to-day life is more apparent.

Just as how expert systems didn't take off and tagging every website for the Semantic Web didn't happen either, we have to accept that the real world of humans is messy and unstructured.

I still advocate making new things more structured. A car on wheels on flattened ground will always be more efficient than skipping the landscaping part and just riding quadruped robots through the forest on uneven terrain. We should develop better information infrastructure but the long tail of existing use cases will require automation that can deal with unstructured mess too.

>it seems some people are really put off by this type of approach

As someone who has had to interact with legacy enterprise systems via RPA (screen scraping and keystroke recording) it is absolutely awful, incredibly brittle, and unmaintainable once you get past a certain level of complexity. Even when it works, performance at scale is terrible.

Everytime I imagine building this, I imagine the “it works” happypath and that I’ll get bit by a deluge of random error messages I never accounted for

adding a neural network in the middle suddenly makes these things less brittle. We are approaching the point where this kind of “hacky glue” is almost scalable.

It's Postel's Law, on steroids. Be liberal in what you accept (with LLMs, that means 'literally anything'), but strict in what you return (which in an LLM is still 'literally anything' but you can constrain that).

Beyond that, I can't help but think of the old thin vs. thick client debate, and I would argue that "software should just talk via APIs" is why, in the web space, everybody is blowing time and energy on building client/server architectures and SPAs instead of basic-ass full-stacks.

It's basically the digital equivalent of humanoid robots - people object because having computers interact with a browser, like building a robot in the form of a human, is incredibly inefficient in theory or if you're designing a system from scratch.

The problem is that we're not starting from scratch - we have a web engineered for browser use and a world engineered for humanoid use. That means an agent that can use a browser, while less efficient than an agent using APIs at any particular task, is vastly more useful because it can complete a much greater breadth of tasks. Same thing with humanoid robots - not as efficient at cleaning the floor as my purpose-built Roomba, but vastly more useful because the breadth of tasks it can accomplish means it can be doing productive things most of the time, as opposed to my Roomba, which is not in use 99% of the time.

I do think that once AI agents become common, the web will increasingly be designed for their use and will move away from the browser, but that probably take a comparable amount of time as it did for the mobile web to emerge after the iPhone came out. (Actually that's probably not true - it'll take less time because AI will be doing the work instead of humans.)

Yes, but my friend would say, all these websites/software should just publish an API and if they don't that's just incompetence/laziness/stupidity. But a "should" doesn't matter. Changing human nature is so immensely difficult, but it feels easy to say "everyone should just [...]". Seems to be a gap in thinking that's hard to bridge.

We took this approach at Industry Dive already because of these reasons. diveaccess.com

Totally agree. A general-purpose solution that ties together different messy interfaces will win in the long run -- i.e the IP protocol, copy-paste, browsers. In these cases, they provide a single-way for different aspects of computing to collaborate. As mentioned before, semantic web initiatives did not succeed and I think there's an important lesson there.

I recall 90's Macs had a 3rd party app that offered to observe your mouse/keyboard then automatically recommend routine tasks for you. As a young person I found that fascinating. It's interesting to see history renew itself.

If you want an API, have Claude procedurally test actions and then write a pyautogui/pywinauto/autohotkey etc script to perform it instead. Have it auto-test to verify and classify the general applicability of each action. Repeat for all of life...

[deleted]

> and how much functionality is really only available through GUI

Isn't the GUI driven by code? Can anything at all in the GUI work that can't be done programmatically?

The code behind the GUI can be arbitrarily obscure. The only reliable way to understand its meaning in a general case is to run it and look at the rendered image. Trying to build a model that implicitly develops an alternative implementation of a browser inside of it sounds worse that just using an existing browser directly.

More often than not you don't have access to the underlying code, or the vendor has interest against you being able to automate it since the complexity is a part of their business model.