Configure a subagent in your coding harness to spin up a new sub-session with any vision model for those tasks and feed the result back to the main model. No need for "one model that does everything"
Configure a subagent in your coding harness to spin up a new sub-session with any vision model for those tasks and feed the result back to the main model. No need for "one model that does everything"
That doesn’t work well in a lot of scenarios. The text LLM doesn’t know what to look for in an image before it sees a description, you might need multiple rounds of back and forth.
Are you suggesting it should summarize the image in text or generate it in HTML or something else?