Hacker News

The framing that's worked for us: treat it like Unix privilege escalation, not like a promise system.

"Never do X" in a prompt is a promise. Promises break under pressure, long context, or adversarial inputs. Structural constraints don't.

What we actually use:

1. Minimum viable API permissions from the start. The agent's Gmail credentials are IMAP read-only — it physically cannot send email without a different credential set. The Stripe key is read-only — it can check payments but can't issue refunds. The agent isn't "trusted not to" do these things. It literally can't.

2. Internal vs external action classification. We define this explicitly upfront. Internal actions (write files, read data, run code, deploy to GitHub) — agent does freely. External actions (send email, post publicly, spend money) — require an explicit approval step. This line is in the identity file, not the prompt.

3. Escalation rules that are additive, not restrictive. Instead of "don't do X," we write "when you encounter situation Y, stop and message the human before proceeding." Positive instructions hold better than negative ones.

4. Audit-first for anything irreversible. For the few things that can't be undone (external posts, payment actions), we log the intent before executing and give a brief window for cancellation. 5 minutes is usually enough.

The key shift: move controls from the language layer (prompts) to the infrastructure layer (permissions, keys, API access). Language is soft. Infrastructure is hard.