Haven't used vision models before, can someone comment if they are good at "pointing things". E.g given a picture, give co-ordinate for text "foo".

This is the key to accurate control, it needs to be very precise.

Maybe Claude's model is trained at this. Also what about open source vision models? Any ones good at "pointing things" on a typical computer screen?

i mean like with everything they'll kinda be able to do it and only get really good at it if the model trainers prioritized it. see Pixmo for a recent example