Claude's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges for Claude and we encourage developers to begin exploration with low-risk tasks.
Nice, but I wonder why didn't they use UI automation/accessibility libraries, that have access to the semantic structure of apps/web pages, as well as accessing documents directly instead of having Excel display them for you.
We use operating system accessibility APIs when available in https://github.com/OpenAdaptAI/OpenAdapt.
I wonder if the model has difficulties for the same reason some people do - UI affordability has gone down with the flattening, hover-to-see scrollbar, hamburger-menu-ization of UIs.
I'd like to see a model trained on a Windows 95/NT style UI - would it have an easier time with each UI element having clearly defined edges, clearly defined click and dragability, unified design language, etc.?
What the UI looks like has no effect on for example, Windows UI Automation libraries. How the tech works is that it queries the process directly for the sematic description of items, like here's a button called 'Delete', here's a list of items for TODO's, and you get the tree structure directly from the API.
I wouldn't be surprised if they are working off of screenshots, they still trained their models on having said screenshots annotated by said automation libraries, which told the AI what pixel is what.
I think this is to make human /user experience better. If you use accessibility features, then user need to know how to use those features. Similar to another comment in here, the UX they shoot for is “click the red button with cancel on it”, and ship that ASAP.