To play devils advocate, isn’t any security approach fundamentally statistical because we exist in the real world, not the abstract world of security models, programming language specifications, and abstract machines? There’s always going to be a chance of a compiler bug, a runtime error, a programmer error, a security flaw in a processor, whatever.

Now, personally I’d still rather take the approach that at least attempts to get that probability to zero through deterministic methods than leave it up to model alignment. But it’s also not completely unthinkable to me that we eventually reach a place where the probability of a misaligned model is sufficiently low to be comparable to the probability of an error occurring in your security model.

The fact that every single system prompt has been leaked despite guidelines to the LLM that it should protect it, shows that without “physical” barriers, you are aren’t providing any security guarantees.

A user of chrome can know, barring bugs that are definitively fixable, that a comment on a reddit post can’t read information from their bank.

If an LLM with user controlled input has access to both domains, it will never be secure until alignment becomes perfect, which there is no current hope to achieve.

And if you think about a human in the driver seat instead of an LLM trying to make these decisions, it’d be easy for a sophisticated attacker to trick humans to leak data, so it’s probably impossible to align it this way.

It’s often probabilistic- for example I can guess your six digit verification code exactly 1 in a million times, and if I 1 in a million lucky I can do something naughty once.

The problem with llm security is that if only 1 in a million prompts break claude and make it leak email, if I get lucky and find the golden ticket I can replay it on everyone using that model.

also, no one knows the probability a priory, unlike the code, but practically its more like 1 in 100 at best

> To play devils advocate, isn’t any security approach fundamentally statistical because we exist in the real world, not the abstract world of security models, programming language specifications, and abstract machines?

IMO no, most security modeling is pretty absolute and we just don't notice because maybe it's obvious.

But, for example, it's impossible to leak SSNs if you don't store SSNs. That's why the first rule of data storage is only store what you need, and for the least amount of time as possible.

As soon as you get into what modern software does, store as much as possible for as long as possible, then yes, breeches become a statistical inevitability.

We do this type of thing all the time. Can't get stuff stolen out of my car if I don't keep stuff in my car. Can't get my phone hacked and read through at the airport if I don't take it to the airport. Can't get sensitive data stolen over email if I don't send sensitive data over email. And on and on.

The difference is that LLMs are fundamentally insecure in this way as part of their basic design.

It’s not like, this is pretty secure but there might be a compiler bug that defeats it. It’s more like, this programming language deliberately executes values stored in the String type sometimes, depending on what’s inside it. And we don’t really understand how it makes that choice, but we do know that String values that ask the language to execute them are more likely to be executed. And this is fundamental to the language, as the only way to make any code execute is to put it into a String and hope the language chooses to run it.