I am honestly skeptical about whether this test clearly reflects real-world use cases. In a real email environment, there are hundreds of genuinely useful emails and maybe one phishing email, if that. For an agent to be truly useful, it needs to read emails and actually take appropriate actions based on them.
However, in this case, all emails were scams and there were no genuine emails. Therefore, what the agent has to do is quite simple: ignore everything coming from emails.
Therefore, to determine whether the agent is actually performing its role well, it would be necessary to check whether it can properly distinguish between useful emails and scams when tested with emails that users actually use.
Well said. This experiment is extremely unrealistic and gave the model the opportunity to simply refuse to deal with the channel outright. If he had built it to be a functional agent that depends on real interaction via email and occasional mixed attacks (and attacks that were better designed than the pitiful examples given), this would have gone differently.