my quick notes on Computer Use:

- "computer use" is basically using Claude's vision + tool use capability in a loop. There's a reference impl but there's no "claude desktop" app that just comes with this OOTB

- they're basically advertising that they bumped up Claude 3.5's screen vision capability. we discussed the importance of this general computer agent approach with David on our pod https://x.com/swyx/status/1771255525818397122

- @minimaxir points out questions on cost. Note that the vision use is very sparing - the loop is I/O constrained - it waits for the tool to run and then takes a screenshot, then loops. for a simple 10 loop task at max resolution, Haiku costs <1 cent, Sonnet 8 cents, Opus 41 cents.

- beating o1-preview on SWEbench Verified without extended reasoning and at 4x cheaper output per token (a lot cheaper in total tokens since no reasoning tokens) is ABSOLUTE mogging

- New 3.5 Haiku is 68% cheaper than Claude Instant haha

references i had to dig a bit to find

- https://www.anthropic.com/pricing#anthropic-api

- https://docs.anthropic.com/en/docs/build-with-claude/vision#...

- loop code https://github.com/anthropics/anthropic-quickstarts/blob/mai...

- some other screenshots https://x.com/swyx/status/1848751964588585319

- https://x.com/alexalbert__/status/1848743106063306826

- model card https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Cla...

Haven't used vision models before, can someone comment if they are good at "pointing things". E.g given a picture, give co-ordinate for text "foo".

This is the key to accurate control, it needs to be very precise.

Maybe Claude's model is trained at this. Also what about open source vision models? Any ones good at "pointing things" on a typical computer screen?

i mean like with everything they'll kinda be able to do it and only get really good at it if the model trainers prioritized it. see Pixmo for a recent example

See https://github.com/OpenAdaptAI/OpenAdapt for an open source implementation that includes a desktop app OOTB.