I wonder if the model has difficulties for the same reason some people do - UI affordability has gone down with the flattening, hover-to-see scrollbar, hamburger-menu-ization of UIs.
I'd like to see a model trained on a Windows 95/NT style UI - would it have an easier time with each UI element having clearly defined edges, clearly defined click and dragability, unified design language, etc.?
What the UI looks like has no effect on for example, Windows UI Automation libraries. How the tech works is that it queries the process directly for the sematic description of items, like here's a button called 'Delete', here's a list of items for TODO's, and you get the tree structure directly from the API.
I wouldn't be surprised if they are working off of screenshots, they still trained their models on having said screenshots annotated by said automation libraries, which told the AI what pixel is what.