Hacker News

That kind of thing is surprisingly hard to implement. To date I've not seen any provider been caught serving up a fake system prompt... which could mean that they are doing it successfully, but I think it's more likely that they determined it's not worth it because there are SO MANY ways someone could get the real one, and it would be embarrassing if they were caught trying to fake it.

Tokens are expensive. How much of your system prompt do you want to waste on dumb tricks trying to stop your system prompt from leaking?

danenania 4 days ago [ - ]

Probably the only way to do it reliably would be to intercept the prompt with a specially trained classifier? I think you're right that once it gets to the main model, nothing really works.

jerjerjer 3 days ago [ - ]

> That kind of thing is surprisingly hard to implement.

If response contains prompt text verbatim (or it is below some distance metric) replace the response text.

Not saying it's trivial to implement (and probably it is hard to do in a pure LLM way), but I don't think it's too hard.

More like it's not really a big secret.