I think I'm not getting something here. Like, sure, the refused prompt "review the code for security issues" could be interpreted as an attempt to discover weaknesses in a running system to exploit them. But we don't generally assume humans are doing something wrong if they are "reviewing code for security issues", and would commonly see no problem with asking each other to do so.
The problem is that a patch to fix a security issue quite often also shines a spotlight on the issue being fixed. Fixing a part of something like this super complicated Project Zero post might not give much of a clue as to what the issue was or how to exploit it: https://projectzero.google/2021/12/a-deep-dive-into-nso-zero...
But that's the exception. Most fixes to security issues point a finger directly at the issue, make it relatively obvious how to exploit, and generally doesn't take long to figure out from there what you might get out of it.
This has been a problem for a long time but AIs have made it even worse. It is now cost effective for a well-resourced attacker to simply monitor the patch stream of an important project like the Linux kernel or nginx and pass every single one through an AI with the question "Is this a vulnerability and if so how would I exploit it?" It has seriously complicated the process of getting fixes to people before the attackers have a chance to exploit it, just as AIs have also been increasing the rate at which serious security issues that have been found also need to be patched. Previously they could at least sneak a patch in under an innocuous commit message and have a reasonable chance of being lost in the churn, but now that door is increasingly closed to them as well.
And this is for the case when a security fix lands in the stream of a project and someone externally is watching it with no context. If you also get the complete stream of Mythos finding and fixing the bug it is even easier.
So, yes, any security vulnerability that Mythos will "fix" is also one that it first has to find, and the guardrails are useless if you can just instruct Mythos to "fix" it. And on the flip side, if Mythos won't fix security bugs, and we project that out to all other models matching this behavior, this will create a world in which the good guys can't secure their code but the bad guys, who will one way or another get around the guard rails if by nothing else simply by stealing the model and modifying it to suit their needs, will be able to break this code that we're not being "allowed" to secure. Since fixing vulns is a subset of finding the vulns, there isn't a way to "fix" this. Any model that can fix vulns must, by necessity, be able to find them. And it is the fixing we really need to be spread far and wide to secure the world's code.
>pass every single one through an AI with the question
Unfortunately this will just involve said teams running their patches over AI first before they're put in the main branch. For businesses it will probably be fine, but would get very expensive for open source projects.
When sama was recruiting Head of Preparedness back in December this is what it was about. Some of it, anyway.