> aggressive exploitation is equivalent to normal bugfixing
It isn't, though. The venn diagram has overlap for sure, and the "normal bugfixing" flows may yield results that are useful for offensive security, but a more targeted prompt asking for a specific security objective would be more effective, if allowed.
If the guardrails can be bypassed at, say 50x token cost (due to the agent also pursuing things you don't care about), then it's still pretty effective as a safeguard, because at that cost you might as well hire humans instead.
And, having to "babysit" a model while you re-prompt to work around guardrails strongly limits how much you can scale up your work.
> If the guardrails can be bypassed at, say 50x token cost […], then it's still pretty effective as a safeguard, because at that cost you might as well hire humans instead.
If humans have to be hired at inflated rates because you’re e.g. the North Korean government, hopefully 50x token costs don’t look competitive.
Not really, you can just get a smaller unrestricted model to prompt the bigger one