Hacker News

I’m new to this concept so may have missed something, but the post [0] seems to be about CoT specifically. In CoT you have an intermediary step that helps the model get better final results; the lesson is that if you try to improve the intermediary steps directly using training data then the model will optimize for better steps but not for better final results.

I don’t think this is the same situation. 1. Anthropic is adjusting weights directly to influence the final results, not training against good/bad results and 2. The target is the final result, not an intermediary.

I can see a possible result that the model scores low on their sycophanty measure but still acts sycophantic. In that case it could be new vector needs be calculated.

[0] https://thezvi.substack.com/p/the-most-forbidden-technique/