> and then terminating the conversation (which it did not do)

This is exactly the safeguard.

Terminating the conversation is the only way to go, these things don't have a world model, they don't know what they are doing, there's no way to correctly assess the situation at the model level. No more conversation, that's the only way even if there might be jailbreaks to circumvent for a motivated adversary.

The problem is, terminating the conversation, even with a closing note to call the crisis line or go talk to a human, is extremely harmful to someone in that situation. To someone who is suicidal, and is being led deeper into their own delusions, just terminating will feel like abandonment or rejection, and push them further over the edge.

The goal in crisis intervention is to bridge them to professional help. Never abandon, always continue the conversation and steer it in a better direction. Ironically enough, in crisis intervention, you should do what LLMs are good at and acknowledge what the person in crisis is feeling, and show empathy. The difference is, the responder needs to reframe it and keep a firm boundary that the person needs professional help.

Basically, recommending the crisis line and then terminating the conversation won't help, and will make it worse.

The model either needs to be a trained crisis responder, or when certain triggers are hit, a human crisis responder needs to hop on the other end and then the human should continue the conversation and talk to the user to de-escalate.

I'd be in favor of having all these AI companies be forced to have crisis responders on staff to take over conversations when they go off the rails.