Just like that, in that that separation is internally enforced, by peoples interpretation and understanding, rather than externally enforced in ways that makes it impossible for you to, e.g. believe the e-mail from an unknown address that claims to be from your boss, or be talked into bypassing rules for a customer that is very convincing.
Being fooled into thinking data is instruction isn't the same as being unable to distinguish them in the first place, and being coerced or convinced to bypass rules that are still known to be rules I think remains uniquely human.
> and being coerced or convinced to bypass rules that are still known to be rules I think remains uniquely human.
This is literally what "prompt injection" is. The sooner people understand this, the sooner they'll stop wasting time trying to fix a "bug" that's actually the flip side of the very reason they're using LLMs in the first place.
The email from your boss and the email from a sender masquerading as your boss are both coming through the same channel in the same format with the same presentation, which is why the attack works. Unless you were both faceblind and bad at recognizing voices, the same attack wouldn't work in-person, you'd know the attacker wasn't your boss. Many defense mechanisms used in corporate email environments are built around making sure the email from your boss looks meaningfully different in order to establish that data vs instruction separation. (There are social engineering attacks that would work in-person though, but I don't think it's right to equate those to LLM attacks.)
Prompt injection is just exploiting the lack of separation, it's not 'coercion' or 'convincing'. Though you could argue that things like jailbreaking are closer to coercion, I'm not convinced that a statistical token predictor can be coerced to do anything.
> The email from your boss and the email from a sender masquerading as your boss are both coming through the same channel in the same format with the same presentation, which is why the attack works.
Yes, that is exactly the point.
> Unless you were both faceblind and bad at recognizing voices, the same attack wouldn't work in-person, you'd know the attacker wasn't your boss.
Irrelevant, as other attacks works then. E.g. it is never a given that your bosses instructions are consistent with the terms of your employment, for example.
> Prompt injection is just exploiting the lack of separation, it's not 'coercion' or 'convincing'. Though you could argue that things like jailbreaking are closer to coercion, I'm not convinced that a statistical token predictor can be coerced to do anything.
It is very much "convincing", yes. The ability to convince an LLM is what creates the effective lack of separation. Without that, just using "magic" values and a system prompt telling it to ignore everything inside would create separation. But because text anywhere in context can convince the LLM to disregard previous rules, there is no separation.
My parent made a claim that humans have separate pathways for data and instructions and cannot mix them up like LLMs do. Showing that we don't has every effect on refuting their argument.
>>> The principal security problem of LLMs is that there is no architectural boundary between data and control paths.
Just like that, in that that separation is internally enforced, by peoples interpretation and understanding, rather than externally enforced in ways that makes it impossible for you to, e.g. believe the e-mail from an unknown address that claims to be from your boss, or be talked into bypassing rules for a customer that is very convincing.
Being fooled into thinking data is instruction isn't the same as being unable to distinguish them in the first place, and being coerced or convinced to bypass rules that are still known to be rules I think remains uniquely human.
> and being coerced or convinced to bypass rules that are still known to be rules I think remains uniquely human.
This is literally what "prompt injection" is. The sooner people understand this, the sooner they'll stop wasting time trying to fix a "bug" that's actually the flip side of the very reason they're using LLMs in the first place.
This makes no sense to me. Being fooled into thinking data is instruction is exactly evidence of an inability to reliably distinguish them.
And being coerced or convinced to bypass rules is exactly what prompt injection is, and very much not uniquely human any more.
The email from your boss and the email from a sender masquerading as your boss are both coming through the same channel in the same format with the same presentation, which is why the attack works. Unless you were both faceblind and bad at recognizing voices, the same attack wouldn't work in-person, you'd know the attacker wasn't your boss. Many defense mechanisms used in corporate email environments are built around making sure the email from your boss looks meaningfully different in order to establish that data vs instruction separation. (There are social engineering attacks that would work in-person though, but I don't think it's right to equate those to LLM attacks.)
Prompt injection is just exploiting the lack of separation, it's not 'coercion' or 'convincing'. Though you could argue that things like jailbreaking are closer to coercion, I'm not convinced that a statistical token predictor can be coerced to do anything.
> The email from your boss and the email from a sender masquerading as your boss are both coming through the same channel in the same format with the same presentation, which is why the attack works.
Yes, that is exactly the point.
> Unless you were both faceblind and bad at recognizing voices, the same attack wouldn't work in-person, you'd know the attacker wasn't your boss.
Irrelevant, as other attacks works then. E.g. it is never a given that your bosses instructions are consistent with the terms of your employment, for example.
> Prompt injection is just exploiting the lack of separation, it's not 'coercion' or 'convincing'. Though you could argue that things like jailbreaking are closer to coercion, I'm not convinced that a statistical token predictor can be coerced to do anything.
It is very much "convincing", yes. The ability to convince an LLM is what creates the effective lack of separation. Without that, just using "magic" values and a system prompt telling it to ignore everything inside would create separation. But because text anywhere in context can convince the LLM to disregard previous rules, there is no separation.
the second leads to first, in case you still don't realize
These are different "agents" in LLM terms, they have separate contexts and separate training
There can be outliers, maybe not as frequent :)
If they were 'clearly different' we would not have the concept of the CEO fraud attack:
https://www.barclayscorporate.com/insights/fraud-protection/...
That's an attack because trusted and untrusted input goes through the same human brain input pathways, which can't always tell them apart.
Your parent made no claim about all swans being white. So finding a black swan has no effect on their argument.
My parent made a claim that humans have separate pathways for data and instructions and cannot mix them up like LLMs do. Showing that we don't has every effect on refuting their argument.
>>> The principal security problem of LLMs is that there is no architectural boundary between data and control paths.
>> Exactly like human input to output.
> no nothing like that
but actually yes, exactly like that.