The main llm will refuse to scan for issues flagged or not, and the cheap model not do a good enough scan on its own.
For models designed/marketed for cybersecurity defensive uses, any predictable refusal mechanism is a vulnerability. It is like being able to cause a kernel panic or segmentation fault .
Even if the gate is fail-reject, an attacker can overwhelm HITL reviews with many false positives and use DoS vectors here.
How will flagging help?
The main llm will refuse to scan for issues flagged or not, and the cheap model not do a good enough scan on its own.
For models designed/marketed for cybersecurity defensive uses, any predictable refusal mechanism is a vulnerability. It is like being able to cause a kernel panic or segmentation fault .
Even if the gate is fail-reject, an attacker can overwhelm HITL reviews with many false positives and use DoS vectors here.
Cheap model replaces trigger words with something innoculous. Of course, this breaks dynamic analysis if malware has unpatched integrity checks