Crudely? Because you can't grep a sequence of latent states for variants of "If I kill all the puny humans, I can <achieve my current goal>."
Crudely? Because you can't grep a sequence of latent states for variants of "If I kill all the puny humans, I can <achieve my current goal>."
Why do you need to grep latent space?
As long as it's giving the right outputs, who cares what's in latent space?
If the model thinks in latent space: "God I wish these people would die," and constantly does the right thing, who cares?
Additionally, if one of it's latent spaces that it never explores is a psychopath -> who cares? The path never gets taken...
That's a lot of harmless people walking around with crazy thoughts...
Thinking ‘God I wish these people would die’ could increase its propensity to kill all people, even if that propensity is still vanishingly small almost all of the time.
A lot of people are walking around with crazy thoughts. Some of them harm.
Tell me you never had a crazy thought and you are either a lier or a psychopath.
[dead]
[flagged]