And this is why it is the height of irresponsibility to run LLMs on your system. We know they are unreliable and just make things up; it's extremely foolish to go "yeah I'm going to let that run commands".
And this is why it is the height of irresponsibility to run LLMs on your system. We know they are unreliable and just make things up; it's extremely foolish to go "yeah I'm going to let that run commands".
It's not _really_ any different to running an undocumented third party binary. Is it the height of irresponsibility to run Windows, or VSCode, or Spotify?
I think the model we've got now is wrong, and the harnesses should be OS-level sandboxed, and the agents should be running in harness managed sandboxes.