it would be a really great option if it didn't lack vision

this is mcp or custom call to lowest cost model

someone did a webcam + agentic + capture of other computer bios/boot -> upload to image model -> back to agent

what do you use vision for? I have failed to find a workflow with it that makes sense, asking it to review screenshots of websites or whatever it misses extremely obvious details like text flowing out of it's container/overlapping other text, things being in entirely the wrong place, etc.

What models have you tried? Gemini 3.1 pro has vision capable of reading my sloppy diaries from 10 years ago, down to small glyphs and doodles.

I mean they mostly work for OCR, I meant in a coding context.

For coding?