Computer use is a great idea. It gets the job done when nothing else will.
If you're a person trying to get their job done at a big company, but half your job is in 1-2 proprietary tools or is stuck behind an API you can't program against, computer use can allow you, a non-techie, to do your job more efficiently.
I think it's an awesome way to circumvent gate keepers and the IT department to let people accomplish their goals.
That is an incredibly niche use case and comes with a boatload of footguns.
Even then, an AI writing AHK scripts likely outperforms.
It does. I used to be an ahk "script kiddie" and know it front and back. It's sort of burnt into my brains. As a result, I can prompt really really well, notice issues at a glance, and I have a sheer volume of scripts locally for all sorts of tasks some from as far back as 2014. From tiling window managers to OCR all the way to simple hotkeys/hotstrings. I let it grep in that folder and build out whatever I want using those primitives. This gives actually 1-shot immediately usable 100% working scripts even with GPT3.5 level models, as opposed to the iterations needed for typical development.
Example: adding copyright text box to bottom of every slide
Computer use is very useful for developing GUI applications since claude code can build and test the entire app end-to-end (accessibility APIs exist but depending on the UI framework of your choosing you can run into walls very fast).
I run it in a VM using a headless wayland compositor, I'd never trust even fable with access to my real system.
I think there's a sweet spot- a lot of the time you're probably better off with "reverse engineer this web page and build me an API or personalized chrome extension to meet my needs".
I have an agent doing price checks for me for an item on a certain website. Instead of blasting through a zillion tokens processing the DOM over and over, it loaded the page once and figured out how to download a json with the price.
Does it have to view the page now repeatedly to download the JSON?
It curls the page. I think the approach it took won't actually wouldn't work in my local browser- its getting the value from some conversion reporting code that I'm guessing my ublock extension would hide.
The tool it built will do viewing, probably.
that tool is called Chrome with extra flags in the CLI
How are folks using “computer use” to click things on intranet portals that are behind an SSO? Even this OP example shows visitors a url and enter this search term… that is port of useless.
How can I automate things behind an SSO wall? Even if it means I manually authorize it once and watch it do things on its own..
I've never used Gemini computer use, but I assume it's the same:
Claude computer use takes control of your whole computer inputs (mouse and keyboard) plus screenshots. You just log in, tell Claude you're logged in, and let it get to work. It'll use the browser you're logged in with.
The chrome extension is a little better because it only takes control of its own chrome tabs (again: you just log in.)
Take manual control once, save the login info to a password manager, teach the model to login with it.
Yeah, it's not that computer use is the most theoretically optimal paradigm, but there's a reasonable case that given the constraints of modern software systems and how they're built, that it's the most realistically optimal paradigm.