That's because you used a LLM trained to produce text, but you asked it to produce actions, not just text. An agentic model would be able to do it, precisely by running that Python code. Someone could argue that a 3 year old does exactly that (produces a plan, then executes it). But these models have deeper issues of lack of comprehension and logical consistency, which prevents us (thankfully) from being able to completely remove the necessity of a man-in-the-middle who keeps an eye on things.