"We want to see risks in the models, so no matter how good the performance and alignment, we’ll see risks, results and reality be damned."
"We want to see risks in the models, so no matter how good the performance and alignment, we’ll see risks, results and reality be damned."
i mean, to be fair, these are professional researchers.
i'm very inclined to trust them on the various ways that models can subtly go wrong, in long-term scenarios
for example, consider using models to write email -- is it a misalignment problem if the model is just too good at writing marketing emails?? or too good at getting people to pay a spammy company?
another hot use case: biohacking. if a model is used to do really hardcore synthetic chemistry, one might not realize that it's potentially harmful until too late (ie, the human is splitting up a problem so that no guardrails are triggered)
"for example, consider using models to write email -- is it a misalignment problem if the model is just too good at writing marketing emails?? or too good at getting people to pay a spammy company?"
But who gets to be the judge of that kind of "misalignment"? giant tech companies?
Might makes right; brains hold reigns.