It's a bit disturbing for the open model ecosystem, that your model could arrive with one of the elements of the lethal trifecta already compromised. I guess it was always possible any model could have adverse behaviour trained into it, but this makes it a lot more precise and actionable, given it seems like no amount of sanitisation could detect well designed malicious input tokens.
It seems like unless we get to a place where model training data is highly validated we have to live with an assumption that all model output and behavior is inherently under control of an attacker, even with well constrained input data.