It seems like real robust guardrails would require some sort of "world model" or some other word to describe - AI that understands intent.

Transformers are (to grossly summarize & I don't mean this as an insult) like auto-complete on steroids. So we have cat&mouse guardrails the way swear word filters and Chinese censorship work. People come up with increasingly complex miss-spelling, euphemisms & indirections to get around the filters like saying May 35th.

I suppose one solution would be to completely vet the training data such that nothing deemed "dangerous" exists in the data, which would be a huge effort.

Even this might not work because for example you could ensure no bomb-related data is in the training data, but there's lots of chemistry data adjacent that if probed the right way would allow the LLM to synthesize the answer. Various forms of "how do I store X,Y,Z safely such that nothing bad happens" prompts probably get you on the way.

>I suppose one solution would be to completely vet the training data such that nothing deemed "dangerous" exists in the data, which would be a huge effort.

I can see how this is tempting, but I suspect it would yield a naive model. I think the only way to improve this is to use a model that is legitimately advanced to support the concept of empathy, which may allow it to recognize others as being separate from itself, similar to how toddlers develop this sense (https://blog.lovevery.com/skills-stages/empathy/)