I was surprised that GLM 5.1/5.2 are not vision models - they are text input only.

That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.

In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.

Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.

Configure a subagent in your coding harness to spin up a new sub-session with any vision model for those tasks and feed the result back to the main model. No need for "one model that does everything"

Are you suggesting it should summarize the image in text or generate it in HTML or something else?

I've been using Google ai studio as a free vision bridge. Gemma 31B is dummy capable at vision and at 1500 rpd its basically unlimited.

I don't see this being such a big gap. There are some use-cases for sure but apart from UX/UI work it is not really needed. Besides, none of the frontier models can replicate actual images - the can approximate at least in my own experience.

One of my tests for a new model is dumping in a screenshot of a web page and seeing if it can recreate it from scratch in HTML and CSS.

Even the local models I run on my Mac are getting surprisingly good at that now.

Using llms to generate docx. Being able to rasterize and review is an important part of the process.

[deleted]

I had the same reaction with Deepseek V4 ! It would be more useful as a vision model