The thing I keep coming back to with local agent sandboxing is that the threat model is actually two separate problems that get conflated.

Problem 1: the agent does something destructive by accident — rm -rf, hard git revert, writes to the wrong config. Filesystem sandboxing solves this well.

Problem 2: the agent does something destructive because it was prompt-injected via a file it read. Sandboxing doesn't help here — the agent already has your credentials in memory before it reads the malicious file.

The only real answer to problem 2 is either never give the agent credentials that can do real damage, or have a separate process auditing tool calls before they execute. Neither is fully solved yet.

Agent Safehouse is a clean solution to problem 1. That's genuinely useful and worth having even if problem 2 remains open.

Matchlock[0] is probably the best solution I've come across so far WRT problem 1 and 2:

> Matchlock is a CLI tool for running AI agents in ephemeral microVMs - with network allowlisting, secret injection via MITM proxy, and VM-level isolation. Your secrets never enter the VM.

In a nutshell, it solves problem #2 through a combination of a network allowlist and secret masking/injection on a per-host basis. Secrets are never actually exposed inside the sandbox. A placeholder string is used inside the sandbox, and the mitm proxy layer replaces the placeholder string with the actual secret key outside of the sandbox before sending the request along to its original destination.

Furthermore, because secrets are available to the sandbox only on a per-host basis, you can specify that you want to share OPENAI_API_KEY only with api.openai.com, and that is the only host for which the placeholder string will be replaced with the actual secret value.

edit to actually add the link

[0] https://github.com/jingkaihe/matchlock

problem 2 is actually scarier than most people realize because it compounds. your agent reads a README in some dependency, that README has injection instructions, now the agent is acting on behalf of the attacker with whatever permissions you gave it. filesystem sandboxing doesnt help because the dangerous action might be "write a backdoor into the file i already have write access to" which is completely within the sandbox rules.

the short-lived scoped credentials approach someone mentioned upthread is probably the best practical mitigation right now. but even that breaks down when the agent legitimately needs broad access to do its job - like if its refactoring across a monorepo it kinda needs write access to everything.

i think the actual answer long term is something closer to capability-based security where each tool call gets its own token scoped to exactly what that specific action needs. but nobody has built that yet in a way that doesnt make the agent 10x slower.

Problem 2 is mitigated by only allowing trusted sources through firewall rules.

I think these are 2 independent axis:

1. Destructive by accident 2. Destructive because it was prompt-injected

And

1. Fucks up filesystem 2. Fucks up external systems via credentials