Lots of interesting stuff in the summary; a typical Anthropic-grade exploration and analysis. Thanks you guys!
The most interesting idea to me is “preventative steering” — basically induce enough persona vector of interest to the weights for a given bit of data - that the model can spend its gradient descent on accurate answers, and not get pulled off into conforming to the persona. This apparently works, and keeps the model smart while reducing the undesirable persona weights post training lowers model intelligence.
Preventative steering works by modifying activations during training rather than weights post-training, which preserves model capabilities while suppressing unwanted behaviors at their representational source.