I’m new to this concept so may have missed something, but the post [0] seems to be about CoT specifically. In CoT you have an intermediary step that helps the model get better final results; the lesson is that if you try to improve the intermediary steps directly using training data then the model will optimize for better steps but not for better final results.
I don’t think this is the same situation. 1. Anthropic is adjusting weights directly to influence the final results, not training against good/bad results and 2. The target is the final result, not an intermediary.
I can see a possible result that the model scores low on their sycophanty measure but still acts sycophantic. In that case it could be new vector needs be calculated.
[0] https://thezvi.substack.com/p/the-most-forbidden-technique/