I think the tension here is that phrasing like this actually helps keep the model aligned, which is why the training and RL converged on it. But it's so annoying to read!
I think the tension here is that phrasing like this actually helps keep the model aligned, which is why the training and RL converged on it. But it's so annoying to read!
repetition of "belt-and-suspenders" kills me with opus, especially because it always means the model is suppressing something I would want to be an actual failure