But why isn't this merely papering over a more fundamental issue with how these models are "aligned"? LLMs are, for example, not inherently sycophantic. kimi k2 and o3 are not, and Sydney, mentioned in the blog post, was most decidedly not.
In my experience, the issue of sycophancy has been longest in the Anthropic models, so it might be most deeply rooted for them. It's only recently, perhaps with the introduction of user A/B preference tests such as by lmarena and the providers themselves has this become a major issue for most other LLMs.
Thinking that simple actions like adding an anti-evil vector to the residual stream to improve behavior sounds naively dangerous. It would not surprise me if unexpected and unwanted downstream effects resulted from this; which a future paper will address too. Not unlike what happened with tuning for user preference.