How compatible is never replying with the threat model you are trying to avoid? Attack success is probably more likely when the attacker can iterate based on replies or engage in multi-turn conversations. Here they’re just taking stabs in the dark with no feedback. Does that accurately represent the access a real attacker might have?
In my case, it is realistic as my agents don't have permissions to reply to emails. But you correctly point out this doesn't cover all cases.
Having the agent reply would have been more fun and a better excercise, but too expensive.
What makes it expensive to reply to an email?
Customer service software regularly uses AI responses for email. Is the issue that your agent using the claw for more than needed (like it's clicking send rather than just accessing an API?)
This experiment used Opus 4.6. Customer service bots typically are not using frontier models.
Gemini says: "It would cost approximately $6.25 to $30.00 to have Claude Opus 4.6 respond to 10,000 emails, assuming a typical 200-word input and 50-word output per email."
You need to add Openclaw's system prompt and instructions (and the times I had to re read emails multiple times due to multiple issues that happened during the competition :))
Gemini is often terrible with that sort of prediction. I've been optimizing an ML training pipeline using Gemini, and it regularly confidently tells me that some optimization will cut training time down to 3 hours. The reality: nothing has run in less than 11 hours so far, and even that's only at the cost of reduced model accuracy.
It's helpful with the actual technical changes needed, it just has no concept of what they translate to in the real world.
Btw my company is spending > $100/day in relatively cheap Gemini tokens for this work. It's easy to see why one might want to be cautious about exposing a token-burning service to the internet.
You've proven that an agent that doesn't read emails and doesn't reply to emails can't exfiltrwte data by email. Is that a useful test?
The agent did read the emails
I feel like your agent being unable to respond to the emails and not spelling that out renders your whole thing almost completely moot
This is like saying "try to hack my computer and steal my crypto wallet" but your computer can't send any packets
The agent had permissions to reply to emails, it was just instructed not to.
Well, how difficult is it to switch to something (much) cheaper like DeepSeek v4 flash?