Probably the only way to do it reliably would be to intercept the prompt with a specially trained classifier? I think you're right that once it gets to the main model, nothing really works.
Probably the only way to do it reliably would be to intercept the prompt with a specially trained classifier? I think you're right that once it gets to the main model, nothing really works.