This works for code because there is an external verification step. The agent has to run code on the machine and observe the results. This is very easy for software since LLMs are software and can just invoke other software, it becomes much harder for many other scientific fields.