In what world would a vision agent be the default, when whatever HTTP-based mechanism a site uses to communicate with the server can usually be reverse-engineered and easily emulated with widely available HTTP request libraries, HTML parsers, and JavaScript engines, and at worst you can use something like Puppeteer to navigate and control applications at a significantly higher level than image scraping and simulating user input?

It seems like you'd need a deliberately hostile app before a vision agent would even be considered as an option.