You might combine a general world model with a python coding model in that case. Not sure if it's better, just saying.

What's the difference between a "general world model combined with a python coding model" and a multimodal LLM?