Hacker News

handoflixue 14 hours ago [ - ]

> But if you show them a jailbreak of their model that bypasses their safety, they'll tell you that any model can eventually be jailbroken so don't worry about safety.

They claim two things:

1) The specific, available jailbreak for Fable 5 is not dangerous - this has been confirmed by multiple experts, and there is no credible evidence against this claim (in other words, Anthropic is probably correct)

2) It is impossible to build an LLM that is immune to all jailbreaks. Again, there is no credible evidence against this claim, i.e. Anthropic is again entirely correct.

If #1 was false, they could just publish the details of the jailbreak - it supposedly only works on Fable 5, so there's no possible danger.

If #2 was false, surely some other LLM lab would have done it by now. Especially since a number of governments have made it clear there is a market for such a project.

mcintyre1994 14 hours ago [ - ]

Supposedly the details of the ‘jailbreak’ are that you give it insecure code and say “fix this code”, and it does, and then you ask it for test scripts and that’s effectively an exploit against the unfixed code.

If true then I have no idea how anyone’s going to release a useful model that doesn’t have the same jailbreak. https://www.theregister.com/security/2026/06/15/feds-freaked...

handoflixue 12 hours ago [ - ]

If that's the extent of the jailbreak, then the government should have banned every existing LLM - their story only makes sense if there's some Fable-specific capability that got unlocked.

le-mark 9 hours ago [ - ]

There’s no logic to it, blocking fable was retaliation and market manipulation by the current admin, nothing more. Poorly conceived as well.

Charon77 12 hours ago [ - ]

> If #2 was false, surely some other LLM lab would have done it by now.

This is a logical flaw. LLM that is immune to jailbreak _could_ exist, but not yet, or maybe nobody talks about it. Yes there's a market, but all of these AI boom is too recent to make any claims.

gf000 12 hours ago [ - ]

Like how would you even define what a jailbreak is?

Charon77 11 hours ago [ - ]

I think pretty much parallel to how social engineering, manipulation, scams work. LLMs are being trained to have human values, prioritizing human lifes, yet people are shocked it will spurt out how to make a nuclear bomb because grandma is being tied to a train track as a hostage.

NavinF 2 hours ago [ - ]

I would also spurt out how to make a nuclear bomb (ie public information you can find using google) if I was told that's what I gotta do to save grandma tied to a train track as a hostage. "AI safety" is such a shit show.

agos 13 hours ago [ - ]

I'm pretty sure that Gödel incompleteness theorem and its consequences pretty much guarantee #2

gwd 12 hours ago [ - ]

I'm guessing you mean, the incompleteness theorem guarantees that nobody can prove their model is un-break-able?

I don't think that's quite what it means. The theorem says that it's impossible to write a function, "will_halt(program, input)", that will be correct for all possible {program, input} pairs. But for a particular program, you may be able to write a proof that it will halt for all inputs -- that's what software verification is about.

The implications here would be that nobody can create a "will_jailbreak(model, input)" function which works for all model/input pairs. But we don't need a general function which works for all model/input pairs; we just need a way to prove that for a specific model, there will be no jailbreaks for any input. As with software verification, this may require that the model be developed in a specific way.

Granted we don't currently know how to make such a proof regarding neural networks; but that's not because of Gödel.

dgellow 12 hours ago [ - ]

Mind to elaborate?

Zababa 10 hours ago [ - ]

No actually I don't think it does and I don't think they're related.

monkey_monkey 12 hours ago [ - ]

Exactly. It's impossible to guarantee #2 doesn't happen (ie protect against all jailbreaks) for any system of sufficient complexity.