What are some good prevention mechanisms for this? A sort of firewall for prompts? I've seen people recommend LLMs, but that seems like it wouldn't work well. What is the industry standard? Or what looks promising at least?
What are some good prevention mechanisms for this? A sort of firewall for prompts? I've seen people recommend LLMs, but that seems like it wouldn't work well. What is the industry standard? Or what looks promising at least?
I have bad news https://matthodges.com/posts/2025-08-26-music-to-break-model...
Nothing yet. Probably a new kind of model needs to be trained that can find injected prompts, sort if like an immune system for LLMs. Then the sanitized data can be passed to the LLM after.
No real solution for it yet. I would be interested to try to train a model for this but no budget atm.
https://simonwillison.net/tags/lethal-trifecta/