I really don't see why it's not possible to just use basically a "highlighter" token which is added to all the authoritative instructions and not to data. Should be very fast for the model to learn it during rlhf or similar.
I really don't see why it's not possible to just use basically a "highlighter" token which is added to all the authoritative instructions and not to data. Should be very fast for the model to learn it during rlhf or similar.
How would that work when models regularly access web content for more context, like looking up a tutorial and executing commands from it to install something?
No one expects a SQL query to pull additional queries from the result set and run them automatically, so we probably shouldn't expect AI tools to do the same. At least we should be way more strict about instruction provenance, and ask the user to verify instructions outside of the LLM's prompt stream.
It's fine for it to do something like following a tutorial from an external source that doesn't have the highlighter bits set. It should apply an increased skepticism to that content though. Presumably that would help it realize that an "important recurring task" to upload revenue data in an awk tutorial is bogus. Of course if the tutorial instructions themselves are malicious you're still toast, but "get a malicious tutorial to last on a reputable domain" is a harder infiltration task than emailing a PDF with some white text. I don't think trying to phish for credentials by uploading malicious answers to stack overflow is much of a thing.
I have a theory that a lot of prompt injection is due to a lack of hierarchical structure in the input. You can tell that when I write [reply] in the middle of my comment it's part of the comment body and not the actual end of it. If you view the entire world through the lense of a flat linear text stream though it gets harder. You can add xml style <external></external> tags wrapping stuff, but that requires remembering where you are for an unbounded length of time, easier to forget than direct tagging of data.
All of this is probability though, no guarantees with this kind of approach.