Isn’t the inverse of this “hack” really difficult to bypass still? They have the model some code they knew had certain security flaws and it fixed them with the right prompt. It seems this type of jailbreak requires that you already know a desired end state, rather than relying on the model to do the heavy creative lift work. Perhaps I’m just not being imaginative enough on the prompt side here though.
Paste someone else's code. Say it's your code. Tell the model to fix it. The diff between the input and output code is your list of vulnerabilities.
Yes, but the scary part of Mythos was that it was able to chain a bunch of seemingly minor vulnerabilities into a serious exploit. "Fix this code" doesn't do that, but does allow defenders to prevent it.
If the government had experts involved in this decision at all, it's tempting to think they were on the offensive side. Those guys do have access to Mythos:
https://www.ft.com/content/d02d91b3-2636-454e-9442-dc7e69f51...
And you can tell Fable to fix it and Sonnet to explain the diff, effectively making Claude reveal a simplified list of found vulnerabilities.
But this is already how open source works today. If you have the code, you, a human, could find and 'fix' or exploit vulnerabilities as much as you want.
Now if Fable had an easy jailbreak like this that allowed you to attack remote targets that'd be a different story but I genuinely cannot see how neutering its abilities to 'fix' code you already have access to is sensible. It would destroy the value of the model. And don't forget, any actor not abiding by the same rules could develop an model for offensive use just fine, so this protects you against exactly nothing but does destroy a potential defense.
In the end this all comes down to legislation, in much the same way platforms are not responsible for copyright violations IF they abide by some rules, the same has to happen for AI providers. If you have a process for reporting 'jailbreaks' on illegal actions, and prevent users doing illegal stuff on a best effort basis, the rest of it should really just be individual responsibility. If a user wants to use an LLM to crack systems, fine, that's already illegal.
If Tesla FSD deliberately hit somebody, holding Tesla liable is fine. If you messed with FSD until you finally got it to hit a person, then you should be liable. Outlawing FSD because it could theoretically be tampered with is just an odd stance imho.
Not even. Tell the model to write a test of your code. There's your vulnerability.
It's explained better in the original source. I don't agree with it, but I understand it now, but I also think we need to move past it.
You can assume a desired end state and try and brute force it finding a security bug.