Maybe treat prompts like it was SQL strings, they need to be sanitized and preferably never exposed to external dynamic user input

The LLM is basically an iterative function going guess_next_text(entire_document). There is no algorithm-level distinction at all between "system prompt" or "user prompt" or user input... or even between its own prior output. Everything is concatenated into one big equally-untrustworthy stream.

I suspect a lot of techies operate with a subconscious good-faith assumption: "That can't be how X works, nobody would ever built it that way, that would be insecure and naive and error-prone, surely those bajillions of dollars went into a much better architecture."

Alas, when it comes to day's the AI craze, the answer is typically: "Nope, the situation really is that dumb."

__________

P.S.: I would also like to emphasize that even if we somehow color-coded or delineated all text based on origin, that's nowhere close to securing the system. An attacker doesn't need to type $EVIL themselves, they just need to trick the generator into mentioning $EVIL.

There have been attempts like https://arxiv.org/pdf/2410.09102 to do this kind of color-coding but none of them work in a multi-turn context since as you note you can't trust the previous turn's output

Yeah, the functionality+security everyone is dreaming about requires much more than "where did the the words come from." As we keep following the thread of "one more required improvement", I think it'll lead to: "Crap, we need to invent a real AI just to keep the LLM in line."

Even just the first step on the list is a doozy: The LLM has no authorial ego to separate itself from the human user, everything is just The Document. Any entities we perceive are human cognitive illusions, the same way that the "people" we "see" inside a dice-rolled mad-libs story don't really exist.

That's not even beginning to get into things like "I am not You" or "I have goals, You have goals" or "goals can conflict" or "I'm just quoting what You said, saying these words doesn't mean I believe them", etc.

Sanitizing free-form inputs in a natural language is a logistical nightmare, so it's likely there isn't any safe way to do that.

Maybe an LLM should do it.

1st run: check and sanitize

2nd run: give to agent with privileges to do stuff

Problems created by using LLMs generally can't be solved using LLMS.

Your best case scenario is reducing risk by some % but you could also make it less reliable or even open up new attack vectors.

Security issues like these need deterministic solutions, and that's exceedingly difficult (if not impossible) with LLMs.

What stops someone prompt injecting the first LLM into passing unsanitised data to the second?

Now you have 2 vulnerable LLMs. Congratulations.

The problem is there is no real way to separate "data" and "instructions" in LLMs like there is for SQL

There's only one input into the LLM. You can't fix that https://www.linkedin.com/pulse/prompt-injection-visual-prime...

SQL strings can be reliably escaped by well-known mechanical procedures.

There is no generally safe way of escaping LLM input, all you can do is pray, cajole, threaten or hope.

Can’t the connections and APIs that an LLM are given to answer queries be authenticated/authorized by the user entering the query? Then the LLM can’t do anything the asking user can’t do at least. Unless you have launch the icbm permissions yourself there’s no way to get the LLM to actually launch the icbm.

Generally the threat model is that a trusted user is trying to get untrusted data into the system. E.g. you have an email monitor that reads your emails and takes certain actions for you, but that means it's exposed to all your emails which may trick the bot into doing things like forwarding password resets to a hacker.

I think it depends what kind of system and attack we're talking about. For corporate environments this approach absolutely makes sense. But say in a user's personal pc where the LLM can act as them, they have permission to do many things they shouldn't - send passwords to attackers, send money to attackers, rm -rf etc

You cannot sanitize prompt strings.

This is not SQL.