You might combine a general world model with a python coding model in that case. Not sure if it's better, just saying.
What's the difference between a "general world model combined with a python coding model" and a multimodal LLM?
What's the difference between a "general world model combined with a python coding model" and a multimodal LLM?