What's surprising to me is that anyone who has a CS education thinking that jailbreaks are not trivial. It is as simple as normal algorithmic reduction [1], e.g can I transform a dangerous task into a not-dangerous task that the LLM will agree to solve, and then re-transform back.
Something being possible doesn't mean it's easy. Transforming a problem from a forbidden shape into an allowed shape could well be harder than just solving the original problem.
I think the article just proved that aggressive exploitation is equivalent to normal bugfixing, so it seems like there are some large and important classes of transform that are easy.
It took me a minute of thinking to understand how this could even be considered a jailbreak; if Anthropic are going to turn out models that can't handle "find and develop regression test scripts for bugs in this program" as a prompt then it is going to take serious model crippling. To be able to prompt the model someone will need to already understand secure programming - the model itself won't be able to independently detect security problems without active guidance.
> aggressive exploitation is equivalent to normal bugfixing
It isn't, though. The venn diagram has overlap for sure, and the "normal bugfixing" flows may yield results that are useful for offensive security, but a more targeted prompt asking for a specific security objective would be more effective, if allowed.
If the guardrails can be bypassed at, say 50x token cost (due to the agent also pursuing things you don't care about), then it's still pretty effective as a safeguard, because at that cost you might as well hire humans instead.
And, having to "babysit" a model while you re-prompt to work around guardrails strongly limits how much you can scale up your work.
> If the guardrails can be bypassed at, say 50x token cost […], then it's still pretty effective as a safeguard, because at that cost you might as well hire humans instead.
If humans have to be hired at inflated rates because you’re e.g. the North Korean government, hopefully 50x token costs don’t look competitive.
Not really, you can just get a smaller unrestricted model to prompt the bigger one
It could be easier when you use a less smart uncensored model to control the smarter but censored one.
The movie M3GAN 2.0 had the exact same plot twist. The kid in the movie even explains outloud what the bot had to do to deal with the limitation. So in other words, since 2025, even teens know this "sandboxing the LLM by layering prompts" thing is never going to work.
I think that as simple as is doing a lot of work when the problem domain is all natural language (or more - all strings?) rather than some well specified DSA problem.
Perhaps my original comment should have been more explicit. I do not regard simple and easy as the same thing, my use of the word trivial was perhaps a confusing aspect there and poorly chosen wording. That is simple things can be hard, and complex things can be easy, but that difficulty and complexity are rather orthogonal.
For more on this see "Simple Made Easy" by Rich Hickey.
New discipline - homomorphic prompting.