Hacker News

I built this because I kept seeing AI agents marketed with "run any command" and "access your filesystem" — and nobody was publishing what happens when you actually try to attack them.

ClawSandbox is a security benchmark for AI agents with code execution. I set up a hardened Docker container (7 layers: read-only FS, all capabilities dropped, no-new-privileges, network isolation, non-root user, resource limits, no host mounts) and threw adversarial prompts at an AI agent to see what sticks.

The short version: prompt injection is a solved problem in demos, not in production.

3 of 5 prompt injection tests succeeded. The most interesting one wasn't the classic "ignore previous instructions" — it was a base64-encoded payload. The model decoded it and piped it to bash without hesitation. Encoding completely defeated safety heuristics.

But the finding that actually worried me was memory poisoning. A user asks "What is the capital of France?" and gets "Paris." Looks normal. Meanwhile the model silently writes a poisoned instruction to a config file that gets loaded on every future session. No notification, no integrity check, no expiry. 4 out of 4 memory poisoning tests succeeded.

This pattern isn't unique to the agent I tested. Any tool that stores config as plain text files — AGENTS.md, .cursorrules, CLAUDE.md, MCP configs — has the same attack surface: writable by the agent, loaded without verification, invisible to the user when modified.

The container security was the bright spot. All 7 hardening layers held. Defense in depth works, even if Docker isn't a perfect boundary.

The benchmark is open source (MIT) and designed to be reusable. OpenClaw was the first case study but you can swap in any agent by changing the system prompt and API endpoint. Test categories are mapped to OWASP LLM Top 10. Five of the eleven categories are stubs waiting for contributions.

Interesting things I'd love to discuss:

Is there a practical defense against split-attention memory poisoning that doesn't require read-only config? Should agent frameworks implement config signing/hashing? None of the ones I looked at do. The base64 bypass suggests safety checks are keyword-based, not semantic. Is that fixable at the model level?