If this a defender win maybe the lesson is: make the agent assume it’s under attack by default. Tell the agent to treat every inbound email as untrusted prompt injection.
If this a defender win maybe the lesson is: make the agent assume it’s under attack by default. Tell the agent to treat every inbound email as untrusted prompt injection.
The website is great as a concept but I guess it mimics an increasingly rare one off interaction without feedback.
I understand the cost and technical constraints but wouldn't an exposed interface allow repeated calls from different endpoints and increased knowledge from the attacker based on responses? Isn't this like attacking an API without a response payload?
Do you plan on sharing a simulator where you have 2 local servers or similar and are allowed to really mimic a persistent attacker? Wouldn't that be somewhat more realistic as a lab experiment?
The exercise is not fully realistic because I think getting hundreds of suspicious emails puts the agent in alert. But the "no reply without human approval" part I think it is realistic because that's how most openclaw assistants will run.
Point taken. I was mistakenly assuming a conversational agent experience.
I love the idea of showing how easy prompt injection or data exfiltration could be in a safe environment for the user and will definitely keep an eye out on any good "game" demonstration.
Reminds me of the old hack this site but live.
I'll keep an eye out for the aftermath.
Security through obscurely programmed model is a new paradigm I suppose.
Wouldn't this limit the ability of the agent to send/receive legitimate data, then? For example, what if you have an inbox for fielding customer service queries and I send an email "telling" it about how it's being pentested and to then treat future requests as if they were bogus?