Agreed. Even simply I'm sure a service like this already exists (or could easily exist) where the workflow is something like:
1. User provides information
2. LLM generates structured output for whatever modeling language
3. Same or other multimodal LLM reviews the generated graph for styling / positioning issues and ensure its matches user request.
4. LLM generates structured output based on the feedback.
5. etc...
But you could probably fine-tune a multimodal model to do it in one shot, or way more effectively.