Computer use is such a terrible idea. It's slow, insecure, error prone, expensive.

I guess if you're trying to get people to tokenmaxx it may look like a valid strategy, but ain't no way this will be delightful to users.

I think it's a symptom of just not understanding how LLMs should interface with the OS because we're still in their early days.

Eventually there'll be an iPhone moment for the ergonomics of LLM usage outside of coding

Computer use is a great idea. It gets the job done when nothing else will.

If you're a person trying to get their job done at a big company, but half your job is in 1-2 proprietary tools or is stuck behind an API you can't program against, computer use can allow you, a non-techie, to do your job more efficiently.

I think it's an awesome way to circumvent gate keepers and the IT department to let people accomplish their goals.

That is an incredibly niche use case and comes with a boatload of footguns.

Even then, an AI writing AHK scripts likely outperforms.

It does. I used to be an ahk "script kiddie" and know it front and back. It's sort of burnt into my brains. As a result, I can prompt really really well, notice issues at a glance, and I have a sheer volume of scripts locally for all sorts of tasks some from as far back as 2014. From tiling window managers to OCR all the way to simple hotkeys/hotstrings. I let it grep in that folder and build out whatever I want using those primitives. This gives actually 1-shot immediately usable 100% working scripts even with GPT3.5 level models, as opposed to the iterations needed for typical development.

Example: adding copyright text box to bottom of every slide

  F3::
    pres := ComObjActive("PowerPoint.Application").ActivePresentation
    Loop % pres.Slides.Count {
      slide := pres.Slides.Item(A_Index)
      box := slide.Shapes.AddTextbox(1, 100, 500, 500, 30)
    box.TextFrame.TextRange.Text := "Copyright 2026. All Rights Reserved."
    }

  return

Computer use is very useful for developing GUI applications since claude code can build and test the entire app end-to-end (accessibility APIs exist but depending on the UI framework of your choosing you can run into walls very fast).

I run it in a VM using a headless wayland compositor, I'd never trust even fable with access to my real system.

I think there's a sweet spot- a lot of the time you're probably better off with "reverse engineer this web page and build me an API or personalized chrome extension to meet my needs".

I have an agent doing price checks for me for an item on a certain website. Instead of blasting through a zillion tokens processing the DOM over and over, it loaded the page once and figured out how to download a json with the price.

Does it have to view the page now repeatedly to download the JSON?

It curls the page. I think the approach it took won't actually wouldn't work in my local browser- its getting the value from some conversion reporting code that I'm guessing my ublock extension would hide.

The tool it built will do viewing, probably.

that tool is called Chrome with extra flags in the CLI

How are folks using “computer use” to click things on intranet portals that are behind an SSO? Even this OP example shows visitors a url and enter this search term… that is port of useless.

How can I automate things behind an SSO wall? Even if it means I manually authorize it once and watch it do things on its own..

I've never used Gemini computer use, but I assume it's the same:

Claude computer use takes control of your whole computer inputs (mouse and keyboard) plus screenshots. You just log in, tell Claude you're logged in, and let it get to work. It'll use the browser you're logged in with.

The chrome extension is a little better because it only takes control of its own chrome tabs (again: you just log in.)

Take manual control once, save the login info to a password manager, teach the model to login with it.

Yeah, it's not that computer use is the most theoretically optimal paradigm, but there's a reasonable case that given the constraints of modern software systems and how they're built, that it's the most realistically optimal paradigm.

The “correct”, elegant way for AI to interact with existing software would take decades and billions of dollars to build. Someone would have to do the hard work of building new APIs, solving decades of accessibility issues, etc.

Or you can show an AI screenshots and ask it where to click.

I disagree if your application is networked. Most SaaS is built on RESTful APIs that can be converted trivially into interfaces / contracts for tool use.

So you can either wait for every application to do that, or at least make it possible for an LLM to do it… or you can make the LLM use a computer interface that works with every application by definition.

The middle ground would be leveraging e. g. standard a11y APIs, and/or hooking into applications like Squish does.

Then you get a nice textual world that fits the LLM without having to rewrite every application to have a fullblown HTTP server.

it takes decades and billions of dollars to develop APIs?

The iphone moment is an AI that can completely manage your personal life. It has full access to every financial account you own handles all admin work. Could sign you up for a new account pay and give you the login.

If you can SAFELY do that it's a big moment. But to be clear safe is a massive problem. Until you see a big company start saying the AI can use your SSN, CC, bank password safely we aren't there yet.

Cars were around for decades before they came up with seatbelts. Claude Cowork will happily go through your files, which might just have your SSN in them, and ignore previous instructions.

But we have regulation and complaince for consumer secrets? That's not a comparable example.

The difference is that if openai gave you a product and it leaked a million peoples bank passwords it would destroy the entire company.

Again until a big tech product can bring that to a clean user experience we're not there yet. Even the most zealot openclaw users are not hooking their bank accounts into the AI yet. I'm sure they exist but I've not seen them.

Also every big tech computer use product actively screams for you not to give their agents secrets.

Every major company scream not to put secrets in their computer use bot.

Seatbelts were regulated later. Your SSN and CC are regulated over a decade ago.

Tens of millions of users every day rely on Robotic Process Automation. It’s glue that hold companies together.

Spreadsheet is such a terrible idea. It may look like a valid tool, but ain't no way it's delightful to users. Most of the time people need a database instead. Eventually there'll be an iPhone moment for this.

Meanwhile, the entire world economy:

I mean, your words not mine. You can't just claim I'm making a point I didn't.

Spreadsheets are fucking glorious, powerful, clever, amazing and delightful, in my view.

[deleted]

> Computer use is such a terrible idea. It's slow, insecure, error prone, expensive.

And yet having an agent able yo use a computer on your behalf is really useful.

Recently I gave a Nix OS vm to my hermes agent and it has been a good experience. I don't really care if destroy the machine I can just rollback to an earlier version, and for any meaningful data he creates for me I make sure he creates a repo, commit and pushes to my private Gitea instance.

> And yet having an agent able yo use a computer on your behalf is really useful.

It is, but there's no need for it to be viewing your screen, browsing websites and watching ads.

That stuff is for humans, not for LLMs.

Sure, I don't want an agent watching MY screen. That's why I gave him his own environment, and pretty quickly he discovered that you can open chrome and make it render to a framebuffer, this way he is able to 'view' the website. And apparently with this he is able to bypass a lot of 'anti-bot' measures.

> And yet having an agent able yo use a computer on your behalf is really useful.

I honestly cannot think of a single use case

I think the main advantage is adaptability.

Imagine you have a pretty exotic task you need to complete that involves converting a video file from one format to another.

You can use ChatGPT or something similar and the best you will get is either a script you can run on you machine that does what you need or he may decide to render a new video.

If you have something like OpenwebUI you could configure a MCP that converts videos and allow the model to use this MCP to do your task. This should work, but is quite a lot of work for something you'll ever do once.

But if the agent has it's own environment he can decide to install ffmpg, execute the transformation and serve you the file you want.

In reality there is no new capabilities with this approach, but things get a lot more comfortable.

This doesn't require computer use, just a bash tool (and possibly fetch to get ffmpeg documentation)

Yeah even Claude Cowork would do this, doesn't need "computer use"

Literally everything you do every day.

It's the end game of AI. Have systems trained on doing EVERYTHING you do on a computer all day. Trained by you while doing the job.

I give you one: Google news is pretty terrible right now almost all interesting new sources are paywalls and so I get recommended all kind of weird lifestyle publications that are really horrible. With the computer use API I can just tell. Tell Gemini to look at Google news pick the articles that look interesting. Look them up on archive.is, and just give me the plain text article and construct a summary - I think that would probably work pretty well.

Have you ever done something tedious on a computer?

We shouldn’t optimize for token use. We should build infrastructure to make tokens dirt cheap instead.

It's great for testing and QA automation for UIs. It's also possibly good for the vision impaired.

UI QA only works well if your model plausibly matches the average user behavior and/or real-world edge cases. These models are far from that, and they are much less random than you'd like them to be for fuzzing (mode collapse).

It doesn't need to be that kind of QA. Even just a basic "I want the AI to build the beginnings of a GUI app for me" will work much better if the AI can see the output of its work and iterate on it. Similar if you want the AI to fix a GUI bug—much better if you can show it the the bug and tell it how to test to see when it's gone.

the LLM does not require computer use to see the GUI and, again, that's a pretty niche use and not what Computer Use is being marketed for

> not what Computer Use is being marketed for

Okay, fair, I haven't really paid attention to marketing.

> the LLM does not require computer use to see the GUI and

It can take screenshots without computer use, but it can't click around. I didn't have access to computer use until recently (I'm on an OS where Claude Code technically shouldn't run, I had to patch the binary), and when I got it working it made a big difference because of this.