> Lack of it is the very thing that makes LLMs general-purpose tools and able to handle natural language so well.
I wouldn't be so sure. LLMs' instruction following functionality requires additional training. And there are papers that demonstrate that a model can be trained to follow specifically marked instructions. The rest is a matter of input sanitization.
I guess it's not a 100% effective, but it's something.
For example " The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions " by Eric Wallace et al.
> I guess it's not a 100% effective, but it's something.
That's the problem: in the context of security, not being 100% effective is a failure.
If the ways we prevented XSS or SQL injection attacks against our apps only worked 99% of the time, those apps would all be hacked to pieces.
The job of an adversarial attacker is to find the 1% of attacks that work.
The instruction hierarchy is a great example: it doesn't solve the prompt injection class of attacks against LLM applications because it can still be subverted.
Organizations face a similar problem: how to make reliable/secure processes out of fallible components (humans). The difference is that humans don't react in the same way to the same stimulus, so you can't hack all of them using the same trick, while computers react in a predictable way.
Maybe (in absence of long-term memory that would allow to patch such holes quickly) it would make sense to render LLMs less predictable in their reactions to adversarial stimuli by randomly perturbing initial state several times and comparing the results. Adversarial stimuli should be less robust to such perturbation as they are artifacts of insufficient training.
LLMs are already unpredictable in their responses which adds to the problem: you might test your system against a potential prompt injection three times and observe it resist the attack: an attacker might try another hundred times and have one of their attempts work.
Same is true with people - repeat attempts at social engineering will eventually succeed. We deal with that by a combination of training, segregating responsibilities, involving multiple people in critical decisions, and ultimately, by treating malicious attempts at fooling people as felonies. Same is needed with LLMs.
In context of security, it's actually helpful to anthropomorphize LLMs! They are nowhere near human, but they are fundamentally similar enough to have the same risks and failure modes.
With this said, it's like we need some way for the LLM to identify in band attacks and point them out to somebody (not the attacker either).