> Clearly with LLMs, bulletproof denials are ~impossible due to the way LLMs work
Exactly. AI safety is nonsensical. You cannot define the set of "bad strings". The billion monkeys with typewriters are eventually going to be able to produce them. Any "safety" system for constraining LLM output is going to have a nonzero leak rate.
But on the other hand, this is also irrelevant, unless you're irresponsible enough to connect an LLM to something that actually matters.
Yes, it's going to alarmingly accelerate vulnerability finding. But, as we know from decades of security research, that's a three way problem already between the devs, the black hats, and the white hats.
Let's not pretend the strategy of "the US will always have a technological advantage and veto over China" will work either.
> unless you're irresponsible enough to connect an LLM to something that actually matters
Remember when people said Artifical Intelligence woun't be dangerous, because nobody will be stupid enough to give it free access to the internet...
> unless you're irresponsible enough to connect an LLM to something that actually matters.
Can't tell if you're saying this tongue-in-cheek or you're a bit out of the loop on what people are doing with LLMs.
And a quick correction:
> unless someone, somewhere is irresponsible enough to connect an LLM to something that actually matters.
"You" can be used as a generalized plural here. Of course people are connecting LLMs to bank accounts, power grids, airline sales, account recovery chatbots and so on. I no longer read COMP.RISKS but I imagine they're having fun with this.
The thing I'm pointing out is that even if you (the generalized plural) do not engage in reckless behavior, you are at the mercy of the lowest common denominator of fellow earth-inhabitants increasingly armed with superweapons via a $20/mo subscription.
The need to acquire expertise and/or a meaningful following has always been a significant impediment to malicious or moronic actors. But less so every day.
This one limitation of LLMs is kind of my bar for "Not truly AI yet" but I'm not saying it as a "its not good at all" type of bar, moreso, know the limits and work from there. LLMs will continue to struggle with things that require intuition for a while I think. It will get really interesting if they can ever truly detect a bad faith actor using them.
A chatbot based on a primitive understanding of human language processing has an attack infinite attack surface.
Isn’t your point that AI safety is impossible to prevent 100% of bad things?
It is quite hard (but not impossible) to get an the frontier AI to tell you how to build a nuke or launder money now, where jailbreaks used to be trivial “ignore all previous instructions”.
It seems like a worthwhile effort.
It's stupid to think that preventing LLMs from giving instructions on building nuclear weapons is at all worthwhile. Total waste of effort, done for PR purposes only. The knowledge has been published in open literature for decades. The real obstacle is access to uranium and refining equipment. No LLM can meaningfully help you get around that.
The idea that an LLM can discern intent on any given prompt is farcical. I might be researching nukes to commit an atrocity, or to prevent one. I might be asking about laundering money to commit a crime, or to prevent one. I might be researching the Nazis because I want to commit a genocide, or I want to read up so I know how to prevent one. Same with cybersecurity. Same with anything.
In my opinion, these companies should put their effort elsewhere. Obviously if all someone is doing on their platform is looking up how to build a nuke, where to buy uranium, the best city to explode it in, etc. please report them to the authorities. If someone is clearly just using LLMs to write hate speech they go post on the internet, ban them. And so on.
This cat & mouse game trying to have LLMs police inquiries is ridiculous to me.
> The idea that an LLM can discern intent on any given prompt is farcical.
Yes, and: the LLM is a "brain in a jar". It doesn't have any ability to verify ground truths outside itself, other than maybe calling out over the internet. Therefore it is easy for humans to lie to. You could call this an "Ender's game" attack, after the book in which a hyperintelligent kid is playing "war games" that end up being the real war.
I don't really agree with it but the government is moving towards making you ID yourself to use frontier AI - i.e. only US citizens are going to be able to use Claude Fable supposedly. In that regime the AI companies would in fact know if you are a money laundering expert or a normal software engineer.
> The idea that an LLM can discern intent on any given prompt is farcical.
Not really though. For most people in most situations it's just not going to give you that info. Software security is a niche where its a bit strange in that there is 100X the amount of white hat users than bad actors and there's open source etc.
The idea that checking for a US ID could possibly stop actual foreign bad actors from using it is also farcical. Millions of stolen identity documents can be bought on the dark web for relatively cheap. North Koreans have been hiring real American citizens for years to infiltrate tons of US tech companies as employees.
And ya, it's pretty easy to hide your intent once you have access.
I think your really anchored on anyone successfully breaking restrictions means any restriction is impossible. So your starting from the position that if it is possible for any actor in the world to get past a restriction, then the whole restriction is a farce.
KYC for example does stop most money laundering and financial crime. The most resourced actors like governments/ cartels often find ways around and it is a game of cat and mouse. Normal citizens don't really stand a chance to get around most of them.
Like it feels like your logic is that we shouldn't do background checks for employment because North Korean spy agencies get past them sometimes?
Hiring an employee, and to a lesser extent opening a bank account, are much higher-touch processes than taking on new users for your massive-scale internet app. With bank accounts and KYC, transactions can be reversed, traced, frozen, etc. after the fact. You can't "take back" API responses the same way.
Clearly, there's no such thing as a perfect exclusion rule at any of these scales, but the false-negative to false-positive ratio seems like it will be way higher if Anthropic starts trying to verify IDs.
Even that is overselling the effort. Last time I checked you could find IDs with a simple image search.
> I might be asking about laundering money to commit a crime, or to prevent one.
Or, much more likely, the same pattern of tokens happen to exist in a completely different discussion, either as a direct metaphor, or as a reality of linguistics. Hell, "laundering" itself is a metaphorical word.
The absurd notion is that any speech should be policed in the first place. If there really is such a thing as dangerous information, then it must be removed from the training data. Any other strategy simply launders the risk.
they arent good at dicerning intent so they dont answer either.
is nonzero leak rate sufficient for someone to practically exploit it? if you have to spend $10000 in tokens to get it to do what you want, is it still worth it? what if they manually review the requests of the users that trigger the guardrails too often?
This is correct and certain subjects are very close to if not impossible like "use versus mention", but LLM security isn't impossible. WAFs are real and have existed for a long time. Input text produces various signals and can be secured.
No security is ever perfect, but we can likely protect LLMs with WAFs that increase security to an acceptable level. Like nation-state required resources to break.