Hacker News

Haven't used vision models before, can someone comment if they are good at "pointing things". E.g given a picture, give co-ordinate for text "foo".

This is the key to accurate control, it needs to be very precise.

Maybe Claude's model is trained at this. Also what about open source vision models? Any ones good at "pointing things" on a typical computer screen?