> Instead they believe model alignment, trying to understand when a user is doing a dangerous task, etc. will be enough.
Maybe I have a fundamental misunderstanding, but I feel like hoping that model alignment and in-model guardrails are statistical preventions, ie you'll reduce the odds to some number of zeroes preceeding the 1. These things should literally never be able to happen, though. It's a fools errand to hope that you'll get to a model where there is no value in the input space that maps to <bad thing you really don't want>. Even if you "stack" models, having a safety-check model act on the output of your larger model, you're still just multiplying odds.
It's a common mistake to apply probabilistic assumptions to attacker input.
The only [citation needed] correct way to use probability in security is when you get randomness from a CSPRNG. Then you can assume you have input conforming to a probability distribution. If your input is chosen by the person trying to break your system, you must assume it's a worst-case input and secure accordingly.
The sortof fun thing is that this happens with human safety teams too. The Swiss Cheese model is generally used to understand how the failures can line up to cause disaster to punch right through the guardrails:
https://medium.com/backchannel/how-technology-led-a-hospital...
It's better to close the hole entirely by making dangerous actions actually impossible, but often (even with computers) there's some wiggle room. For example, if we reduce the agent's permissions, then we haven't eliminated the possibility of those permissions being exploited, merely required some sort of privilege escalation to remove the block. If we give the agent an approved list of actions, then we may still have the possibility of unintended and unsafe interactions between those actions, or some way an attacker could add an unsafe action to the list. And so on, and so forth.
In the case of an AI model, just like with humans, the security model really should not assume that the model will not "make mistakes." It has a random number generator built right in. It will, just like the user, occasionally do dumb things, misunderstand policies, and break rules. Those risks have to be factored in if one is to use the things at all.
Humans are dramatically stronger than LLMs. An LLM is like a human you can memory wipe and try to phish hundreds of times a second until you find a script that works. I agree with what you're saying, but it's important to frame an LLM is not like a security guard who will occasionally let a former employee in because they recognize them. They can be attacked pretty relentlessly and once they're open they're wide open.
Thank you for that link, that was a great read.
To play devils advocate, isn’t any security approach fundamentally statistical because we exist in the real world, not the abstract world of security models, programming language specifications, and abstract machines? There’s always going to be a chance of a compiler bug, a runtime error, a programmer error, a security flaw in a processor, whatever.
Now, personally I’d still rather take the approach that at least attempts to get that probability to zero through deterministic methods than leave it up to model alignment. But it’s also not completely unthinkable to me that we eventually reach a place where the probability of a misaligned model is sufficiently low to be comparable to the probability of an error occurring in your security model.
The fact that every single system prompt has been leaked despite guidelines to the LLM that it should protect it, shows that without “physical” barriers, you are aren’t providing any security guarantees.
A user of chrome can know, barring bugs that are definitively fixable, that a comment on a reddit post can’t read information from their bank.
If an LLM with user controlled input has access to both domains, it will never be secure until alignment becomes perfect, which there is no current hope to achieve.
And if you think about a human in the driver seat instead of an LLM trying to make these decisions, it’d be easy for a sophisticated attacker to trick humans to leak data, so it’s probably impossible to align it this way.
It’s often probabilistic- for example I can guess your six digit verification code exactly 1 in a million times, and if I 1 in a million lucky I can do something naughty once.
The problem with llm security is that if only 1 in a million prompts break claude and make it leak email, if I get lucky and find the golden ticket I can replay it on everyone using that model.
also, no one knows the probability a priory, unlike the code, but practically its more like 1 in 100 at best
> To play devils advocate, isn’t any security approach fundamentally statistical because we exist in the real world, not the abstract world of security models, programming language specifications, and abstract machines?
IMO no, most security modeling is pretty absolute and we just don't notice because maybe it's obvious.
But, for example, it's impossible to leak SSNs if you don't store SSNs. That's why the first rule of data storage is only store what you need, and for the least amount of time as possible.
As soon as you get into what modern software does, store as much as possible for as long as possible, then yes, breeches become a statistical inevitability.
We do this type of thing all the time. Can't get stuff stolen out of my car if I don't keep stuff in my car. Can't get my phone hacked and read through at the airport if I don't take it to the airport. Can't get sensitive data stolen over email if I don't send sensitive data over email. And on and on.
The difference is that LLMs are fundamentally insecure in this way as part of their basic design.
It’s not like, this is pretty secure but there might be a compiler bug that defeats it. It’s more like, this programming language deliberately executes values stored in the String type sometimes, depending on what’s inside it. And we don’t really understand how it makes that choice, but we do know that String values that ask the language to execute them are more likely to be executed. And this is fundamental to the language, as the only way to make any code execute is to put it into a String and hope the language chooses to run it.
All modern computer security is based on trying to improbabilities. Public key cryptography, hashing, tokens, etc are all based on being extremely improbable to guess, but not impossible. If an LLM can eventually reach that threshold, it will be good enough.
Cryptography's risk profile is modeled against active adversaries. The way probability is being thrown around here is not like that. If you find 1 in a billion in the full training set of data that triggers this behavior, that's not the same as 1 in a billion against an active adversary. In cryptography there are vulnerabilities other than brute force.
"These things should literally never be able to happen"
If we consider "humans using a bank website" and apply the same standard, then we'd never have online banking at all. People have brain farts. You should ask yourself if the failure rate is useful, not if it meets a made up perfection that we don't even have with manual human actions.
Just because humans are imperfect and fall for scams and phishing doesn't mean we should knowingly build in additional attack mechanisms. That's insane. Its a false dilemma.
Go hire some rando off the street, sit them down in front of your computer, and ask them to research some question for you while logged into your user account and authenticated to whatever web sites you happen to be authenticated to.
Does this sound like an absolutely idiotic idea that you’d never even consider? It sure does to me.
Yes, humans also aren’t very secure, which is why nobody with any sense would even consider doing this either a human.
The vast majority of humans would fall to bad security.
I think we should continue experimenting with LLMs and AI. Evolution is littered with the corpses of failed experiments. It would be a shame if we stopped innovating and froze things with the status quo because we were afraid of a few isolated accidents.
We should encourage people that don't understand the risks not to use browsers like this. For those that do understand, they should not use financial tools with these browsers.
Caveat emptor.
Don't stall progress because "eww, AI". Humans are just as gross.
We need to make mistakes to grow.
We can continue to experiment while also going slowly. Evolution happens over many millions of years, giving organisms a chance to adapt and find a new niche to occupy. Full-steam-ahead is a terrible way to approach "progress".
> while also going slowly
That's what risk-averse players do. Sometimes it pays off, sometimes it's how you get out-innovated.
If the only danger is the company itself bankrupt, then please, take all the risks you like.
But if they're managing customer-funds or selling fluffy asbestos teddybears, then that's a problem. It's a profoundly different moral landscape when the people choosing the risks (and grabbing any rewards) aren't the people bearing the danger.
You can have this outrage when your parents are using browser user agents.
All of this concern is over a hypothetical Reddit comment about a technology used by early adopter technologists.
Nobody has been harmed.
We need to keep building this stuff, not dog piling on hate and fear. It's too early to regulate and tie down. People need to be doing stupid stuff like ordering pizza. That's exactly where we are in the tech tree.
"We need to keep building this stuff" Yeah, we really don't. As in there is literally no possible upside for society at large to continuing down this path.
Well if we eliminate greed and capitalism then maybe at some point we can reach a Star Trek utopia where nobody has to work because we eliminate scarcity.
... Either that or the wealthy just hoard their money-printers and reject the laborers because they no longer need us to make money so society gets split into 99% living in feudal squalor and 1% living as Gods. Like in Jupiter Ascending. Man what a shit movie that was.
We basically eliminated scarcity a few generations ago and yet here we are.
This AI browser agent is outright dangerous as it is now. Nobody has been attacked this way... that we know of... yet.
It's one thing to build something dangerous because you just don't know about it yet. It's quite another to build something dangerous knowing that it's dangerous and just shrugging it off.
Imagine if Bitcoin was directly tied to your bank account and the protocol inherently allowed other people to perform transactions on your wallet. That's what this is, not "ordering pizza."
When your “mistakes” are “a user has their bank account drained irrecoverably”, no, we don’t.
So let's stop building browser agents?
This is a hypothetical Reddit comment that got Tweeted for attention. The to-date blast radius of this is zero.
What you're looking at now is the appropriate level of concern.
Let people build the hacky pizza ordering automations so we can find the utility sweet spots and then engineer more robust systems.
1. Access to untrusted data.
2. Access to private data.
3. Ability to communicate externally.
An LLM must not be given all three of these, or it is inherently insecure. Any two is fine (mostly, private data and external communication is still a bit iffy), but if you give them all three then you're screwed. This is inherent to how LLMs work, you can't fix it as the technology stands today.
This isn't a secret. It's well known, and it's also something you can easily derive from first principles if you know the basics of how LLMs work.
You can build browser agents, but you can't give them all three of these things. Since a browser agent inherently accesses untrusted data and communicates externally, that means that it must not be given access to private data. Run it in a separate session with no cookies or other local data from your main session and you're fine. But running it in the user's session with all of their state is just plain irresponsible.
The CEO of Perplexity hasn't addressed this at all, and instead spent all day tweeting about the transitions in their apps. They haven't shown any sign of taking this seriously and this exploit has been known for more than a month: https://x.com/AravSrinivas/status/1959689988989464889
> So let's stop building browser agents?
Yes, because the idea is stupid and also the reality turns out to be stupid. No part of this was not 100% predictable.
> So let's stop building browser agents?
The phrasing of this seems to imply that you think this is obviously ridiculous to the point that you can just say it ironically. But I actually think that's a good idea.