Hacker News

Click coordinates. Agentic GUI is really annoying when the multi-modal agent cannot click on x,y coordinates.

I tested Qwen3.6, Gemma4, Nemotron3-nano-omni. They fully hallucinate x,y coords. (did not try GLM-5V yet)

GPT-5.5 can easily do it. But also Vocaela, a tiny 500M model, is quite good at it. Hope they improve the training for x,y clicking soon on the smallish multi-modals.

Recently slopped a http service together just so my local models can click, instead of relying on all the wild ways agents currently hack into the browser (browser-use, browser-harness, agent-browser, dev-browser etc) https://github.com/julius/vocaela-click-coords-http

lopuhin 17 hours ago [ - ]

Qwen3.5 is able to output click coordinates and bounding boxes just fine, as values normalized to 0..1000, I’d hope Qwen3.6 didn’t loose this capability.

withinrafael 17 hours ago [ - ]

I've had lots of success with generating coordinates and answering questions using the UI-TARS model https://github.com/bytedance/UI-TARS.

theturtletalks 15 hours ago [ - ]

I’d also checkout midscene, you can set the model and UI-TARS works but you can also use qwen vision models and it works.

cyanydeez 19 hours ago [ - ]

This sounds a lot like another hacker news posted in the last few days. The same problem image generators have with a prompt like, produce numbers 1-50 in a spiral pattern and it can't count properly. But if you break it into a raster/vector where you have it first produce the visual content and then a SVG overlay, it's completely capable.

Have you tried doing a two step: review the image, then render a vector?

julius 19 hours ago [ - ]

Maybe there is a smart trick to get them to do the right thing, but the things I tried did not work.

At one point I had some smaller model draw bounding boxes around everything that looked interactable and labels like "e3" ... then asked the model to tell me "click on e3". Did not work in my tests was pretty much as bad as x,y.

cyanydeez 18 hours ago [ - ]

Yeah, I've held off on doing any kind of rag till there's models that properly handle layout detection and partitioning because it's so easy to generate shitty data if you're not properly attending to visual cues first before you slice up a document.