> But if you show them a jailbreak of their model that bypasses their safety, they'll tell you that any model can eventually be jailbroken so don't worry about safety.

They claim two things:

1) The specific, available jailbreak for Fable 5 is not dangerous - this has been confirmed by multiple experts, and there is no credible evidence against this claim (in other words, Anthropic is probably correct)

2) It is impossible to build an LLM that is immune to all jailbreaks. Again, there is no credible evidence against this claim, i.e. Anthropic is again entirely correct.

If #1 was false, they could just publish the details of the jailbreak - it supposedly only works on Fable 5, so there's no possible danger.

If #2 was false, surely some other LLM lab would have done it by now. Especially since a number of governments have made it clear there is a market for such a project.

Supposedly the details of the ‘jailbreak’ are that you give it insecure code and say “fix this code”, and it does, and then you ask it for test scripts and that’s effectively an exploit against the unfixed code.

If true then I have no idea how anyone’s going to release a useful model that doesn’t have the same jailbreak. https://www.theregister.com/security/2026/06/15/feds-freaked...

If that's the extent of the jailbreak, then the government should have banned every existing LLM - their story only makes sense if there's some Fable-specific capability that got unlocked.

There’s no logic to it, blocking fable was retaliation and market manipulation by the current admin, nothing more. Poorly conceived as well.

> If #2 was false, surely some other LLM lab would have done it by now.

This is a logical flaw. LLM that is immune to jailbreak _could_ exist, but not yet, or maybe nobody talks about it. Yes there's a market, but all of these AI boom is too recent to make any claims.

Like how would you even define what a jailbreak is?

I think pretty much parallel to how social engineering, manipulation, scams work. LLMs are being trained to have human values, prioritizing human lifes, yet people are shocked it will spurt out how to make a nuclear bomb because grandma is being tied to a train track as a hostage.

I would also spurt out how to make a nuclear bomb (ie public information you can find using google) if I was told that's what I gotta do to save grandma tied to a train track as a hostage. "AI safety" is such a shit show.

I'm pretty sure that Gödel incompleteness theorem and its consequences pretty much guarantee #2

I'm guessing you mean, the incompleteness theorem guarantees that nobody can prove their model is un-break-able?

I don't think that's quite what it means. The theorem says that it's impossible to write a function, "will_halt(program, input)", that will be correct for all possible {program, input} pairs. But for a particular program, you may be able to write a proof that it will halt for all inputs -- that's what software verification is about.

The implications here would be that nobody can create a "will_jailbreak(model, input)" function which works for all model/input pairs. But we don't need a general function which works for all model/input pairs; we just need a way to prove that for a specific model, there will be no jailbreaks for any input. As with software verification, this may require that the model be developed in a specific way.

Granted we don't currently know how to make such a proof regarding neural networks; but that's not because of Gödel.

Mind to elaborate?

No actually I don't think it does and I don't think they're related.

Exactly. It's impossible to guarantee #2 doesn't happen (ie protect against all jailbreaks) for any system of sufficient complexity.