I'm pretty sure that Gödel incompleteness theorem and its consequences pretty much guarantee #2

I'm guessing you mean, the incompleteness theorem guarantees that nobody can prove their model is un-break-able?

I don't think that's quite what it means. The theorem says that it's impossible to write a function, "will_halt(program, input)", that will be correct for all possible {program, input} pairs. But for a particular program, you may be able to write a proof that it will halt for all inputs -- that's what software verification is about.

The implications here would be that nobody can create a "will_jailbreak(model, input)" function which works for all model/input pairs. But we don't need a general function which works for all model/input pairs; we just need a way to prove that for a specific model, there will be no jailbreaks for any input. As with software verification, this may require that the model be developed in a specific way.

Granted we don't currently know how to make such a proof regarding neural networks; but that's not because of Gödel.

Mind to elaborate?

No actually I don't think it does and I don't think they're related.

Exactly. It's impossible to guarantee #2 doesn't happen (ie protect against all jailbreaks) for any system of sufficient complexity.